Ultra Realistic Robotics Lama from Canva text to image

Ollama — LLM on your Local and Measure Model Latency

Deploy your LLM at no cost on a Local, Dedicated Machine or Kubernetes Cluster. Available for Nodejs, Python, Docker, REST API, and many apps on various platforms.

Sarin Suriyakoon

3 min readFeb 18, 2024

Why running LLMs locally matters

Development Cost — Using production API could cost a lot with several people and CI running more than running locally.
Support E2E Journey — Running locally allows us to do E2E development and testing with ease.
Encourage Experiment and Learning — Experimenting and Learning without paying money is awesome!
Only use what you need — Of course, you need the smartest model in the world. But is it needed? This helps you start with a model with minimum size and scale up as needed. This practice creates mini prompt optimization habits.
Help us improve the writing prompt better because it is less smart — Let me explain, The Large Language model aka the smart model doesn’t require a lot of prompting skill but the small one forces us to get better at prompting.

What is Ollama?

Ollama is a tools that allow you to run LLM or SLM(7B) on your machine.

It is a REST API service on your machine. This is my favourite feature. It allows many integrations. Simple but powerful.
It has CLI — ex. ollama run llama2
It can run on Linux, MacOS, and Windows.
It has a library for both Nodejs and Python
It can run on Docker which means it can run on Kubernetes Cluster as well.
It allows you to import and customize a model from GGUF, PyTorch or Safetensors.
It has a Model library that you can push , pull to interact with. Think of it like Docker for LLM
It has a long list of Community Integration
It is integrated with LangChain, LLamaIndex and Haystack

Installation

Go to https://ollama.com/
Click Download button

Start Chatting with llama2 right away with

ollama run llama2

If you’d like to try a multimodal model that can read the image

ollama run llava

put the path in the prompt like this
>>> What's in this image? /Users/[yourhome]/Desktop/smile.png

My favourite feature is Ollama REST API

Ollama also has its REST API, you can call it through curl

Copy and Paste this into your Terminal

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

Now you should see the response streaming to you in a few seconds.

The last response will show additional information like….

Measure Latency

The last response contains two interesting properties which are eval_duration and eval_count

eval_count: number of tokens the response
eval_duration: time in nanoseconds spent generating the response

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration.

Full credit to Ollama’s amazing API Docs

Conclusion

Ollama is an amazing piece of software to help you get up and running with LLM. I include “Measure Latency” here because it is important to measure the performance of the tools you use, especially when you need to compare it with OpenAI GPT-4, Claude, Gemini, and other services for cost analysis.

You should be able to explore more of Ollama in its repo here

Have fun hacking!

Source

ollama/examples at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/examples at main · ollama/ollama

github.com

ollama/docs at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/docs at main · ollama/ollama

github.com

Follow for more

Ollama — LLM on your Local and Measure Model Latency

Deploy your LLM at no cost on a Local, Dedicated Machine or Kubernetes Cluster. Available for Nodejs, Python, Docker, REST API, and many apps on various platforms.

Why running LLMs locally matters

What is Ollama?

Installation

Start Chatting with llama2 right away with

My favourite feature is Ollama REST API

Measure Latency

Full credit to Ollama’s amazing API Docs

Conclusion

Source

ollama/examples at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/examples at main · ollama/ollama

ollama/docs at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/docs at main · ollama/ollama

Written by Sarin Suriyakoon

No responses yet

Ollama — LLM on your Local and Measure Model Latency

Deploy your LLM at no cost on a Local, Dedicated Machine or Kubernetes Cluster. Available for Nodejs, Python, Docker, REST API, and many apps on various platforms.

Why running LLMs locally matters​

What is Ollama?

Installation

Start Chatting with llama2 right away with

My favourite feature is Ollama REST API

Measure Latency

Full credit to Ollama’s amazing API Docs

Conclusion

Source

ollama/examples at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/examples at main · ollama/ollama

ollama/docs at main · ollama/ollama

Get up and running with Llama 2, Mistral, and other large language models. - ollama/docs at main · ollama/ollama

Written by Sarin Suriyakoon

No responses yet

Why running LLMs locally matters