Ultra Realistic Robotics Lama from Canva text to image

Ollama — LLM on your Local and Measure Model Latency

Deploy your LLM at no cost on a Local, Dedicated Machine or Kubernetes Cluster. Available for Nodejs, Python, Docker, REST API, and many apps on various platforms.

Sarin Suriyakoon
3 min readFeb 18, 2024

--

Why running LLMs locally matters

  • Development Cost — Using production API could cost a lot with several people and CI running​ more than running locally.
  • Support E2E Journey — Running locally allows us to do E2E development and testing​ with ease.
  • Encourage Experiment and Learning — Experimenting and Learning without paying money is awesome!​
  • Only use what you need — Of course, you need the smartest model in the world. But is it needed?​ ​ This helps you start with a model with minimum size and scale up as needed. This practice creates mini prompt optimization habits.
  • Help us improve the writing prompt better because it is less smart — Let me explain, The Large Language model aka the smart model doesn’t require a lot of prompting skill but the small one forces us to get better at prompting.

What is Ollama?

Ollama is a tools that allow you to run LLM or SLM(7B) on your machine.

  • It is a REST API service on your machine. This is my favourite feature. It allows many integrations. Simple but powerful.
  • It has CLI — ex. ollama run llama2
  • It can run on Linux, MacOS, and Windows.
  • It has a library for both Nodejs and Python
  • It can run on Docker which means it can run on Kubernetes Cluster as well.
  • It allows you to import and customize a model from GGUF, PyTorch or Safetensors.
  • It has a Model library that you can push , pull to interact with. Think of it like Docker for LLM
  • It has a long list of Community Integration
  • It is integrated with LangChain, LLamaIndex and Haystack

Installation

ollama download section

Start Chatting with llama2 right away with

ollama run llama2

or

If you’d like to try a multimodal model that can read the image

ollama run llava

put the path in the prompt like this
>>> What's in this image? /Users/[yourhome]/Desktop/smile.png

My favourite feature is Ollama REST API

Ollama also has its REST API, you can call it through curl

Copy and Paste this into your Terminal

curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'

Now you should see the response streaming to you in a few seconds.

The last response will show additional information like….

Measure Latency

The last response contains two interesting properties which are eval_duration and eval_count

  • eval_count: number of tokens the response
  • eval_duration: time in nanoseconds spent generating the response

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration.

Full credit to Ollama’s amazing API Docs

Conclusion

Ollama is an amazing piece of software to help you get up and running with LLM. I include “Measure Latency” here because it is important to measure the performance of the tools you use, especially when you need to compare it with OpenAI GPT-4, Claude, Gemini, and other services for cost analysis.

You should be able to explore more of Ollama in its repo here

Have fun hacking!

--

--

No responses yet