Ollama — LLM on your Local and Measure Model Latency
Deploy your LLM at no cost on a Local, Dedicated Machine or Kubernetes Cluster. Available for Nodejs, Python, Docker, REST API, and many apps on various platforms.
Why running LLMs locally matters
- Development Cost — Using production API could cost a lot with several people and CI running more than running locally.
- Support E2E Journey — Running locally allows us to do E2E development and testing with ease.
- Encourage Experiment and Learning — Experimenting and Learning without paying money is awesome!
- Only use what you need — Of course, you need the smartest model in the world. But is it needed? This helps you start with a model with minimum size and scale up as needed. This practice creates mini prompt optimization habits.
- Help us improve the writing prompt better because it is less smart — Let me explain, The Large Language model aka the smart model doesn’t require a lot of prompting skill but the small one forces us to get better at prompting.
What is Ollama?
Ollama is a tools that allow you to run LLM or SLM(7B) on your machine.
- It is a REST API service on your machine. This is my favourite feature. It allows many integrations. Simple but powerful.
- It has CLI — ex.
ollama run llama2
- It can run on Linux, MacOS, and Windows.
- It has a library for both Nodejs and Python
- It can run on Docker which means it can run on Kubernetes Cluster as well.
- It allows you to import and customize a model from GGUF, PyTorch or Safetensors.
- It has a Model library that you can
push
,pull
to interact with. Think of it like Docker for LLM - It has a long list of Community Integration
- It is integrated with LangChain, LLamaIndex and Haystack
Installation
- Go to https://ollama.com/
- Click Download button
Start Chatting with llama2 right away with
ollama run llama2
or
If you’d like to try a multimodal model that can read the image
ollama run llava
put the path in the prompt like this>>> What's in this image? /Users/[yourhome]/Desktop/smile.png
My favourite feature is Ollama REST API
Ollama also has its REST API, you can call it through curl
Copy and Paste this into your Terminal
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
Now you should see the response streaming to you in a few seconds.
The last response will show additional information like….
Measure Latency
The last response contains two interesting properties which are eval_duration
and eval_count
eval_count
: number of tokens the responseeval_duration
: time in nanoseconds spent generating the response
To calculate how fast the response is generated in tokens per second (token/s), divide eval_count
/ eval_duration
.
Full credit to Ollama’s amazing API Docs
Conclusion
Ollama is an amazing piece of software to help you get up and running with LLM. I include “Measure Latency” here because it is important to measure the performance of the tools you use, especially when you need to compare it with OpenAI GPT-4, Claude, Gemini, and other services for cost analysis.
You should be able to explore more of Ollama in its repo here
Have fun hacking!