Google Gemini “Ninja Lama”

Deploy Ollama on Local Kubernetes in 15 minutes

Sarin Suriyakoon
5 min readMar 12, 2024

--

Let’s deploy Ollama(LLM REST API) to your local Kubernetes

Overview

  • Background
  • Tools
  • Install MicroK8s
  • Add Kubernetes Resource from Dashboard
  • Check the Dashboard
  • Run Port Forward
  • Test if it works!
  • Conclusion
  • Source

Background

Scalability and High Availability of your services are crucial in business and real-world scenarios. Kubernetes is the tool of this era to orchestrate this task. Imagine if you could deploy LLM as a REST API(with multiple models to choose from) and it is ready to scale.

Sounds too good to be true, eh?

Well, enter Ollama + Kubernetes.

Tools

We are going to demonstrate this combination of tools on your local MacOS. If you are using Linux and Windows, both tools are supported please check the official documentation.

MicroK8s — This tool allows you to run Kubernetes locally. I prefer this over Minikube because it’s lightweight, easy to install, and good for quick and easy experiments. It also comes with a Dashboard which we will use today. The dashboard is going to help you operate Kubernetes locally with its Web UI.

Ollama — This is a great tool for experimenting with and using the Large Language Model(LLM) as a REST API without scientists or extensive AI coding knowledge. Also, the list of models that can be used in Ollama is ever-growing(Gemma, Mistral, Orca, LLama2 and many more). Check the its library here.

With LLM as a REST API, you can imagine, you’d scale it like any other service on Kubernetes.

Ollama itself is optional because we will pull its image inside the Kubernetes anyway. It is nice to have because you can test our Kubernetes service with Ollama CLI.

Let’s get cooking.

Install MicroK8s

Run this series of commands in your terminal

Check out https://microk8s.io/docs/install-macos

or use a consolidated list here

#Download
brew install ubuntu/microk8s/microk8s
#Run installer
microk8s install
#Check the status while Kubernetes starts
microk8s status --wait-ready
#Turn on Dashboard service
microk8s enable dashboard
#Start using Kubernet
microk8s kubectl get all --all-namespaces
#Start using Dashboard
microk8s dashboard-proxy

#Utility command
#To Start and Stop
microk8s start
microk8s stop

After running microk8s dashboard-proxy

you should see the URL and token needed to go into the dashboard.

Dashboard info — copy and paste URL into your browser

Open the URL in a browser and should see the Dashboard. We are ready to add Kubernetes resources.

MicroK8s Dashboard

Add Kubernetes Resource from Dashboard

Click the + plus sign on the top right of the screen, you should see a form to add Kubernetes Resource

Add Kubernetes resource form

Add the following Namespace resource

Namespace

apiVersion: v1
kind: Namespace
metadata:
name: ollama

Then select this ollama namespace on the top left beside the Kubernetes logo

Select namespace — ollama

Let’s continue adding Deployment and Service resources in the aforementioned form.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP

Service

apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
name: ollama
ports:
- port: 80
name: http
targetPort: http
protocol: TCP

Check the Dashboard

You should see Deployment and Service running in the dashboard

Run Port Forward

Wait until ollama deployment and service are green then run

microk8s kubectl -n ollama port-forward service/ollama 11434:80

This will forward any localhost:11434 request to service port 80.

localhost:11434 is the Ollama standard address and port.

Test if it works!

Make sure the Ollama app is closed and any ollama serve is terminated on your local. Only microK8s should be running ollama. Then run this

ollama run orca-mini:3b

or with CURL

curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt":"Why is the sky blue?"
}'

The first time you run this, it will take some time to pull the model. Go to the dashboard again and watch the pod log by clicking

Pods

click ollama-[hash key]

and click

You should see the log like this

You should see either ollama CLI or curl return something after the model is loaded. Start Chatting!

Conclusion

Now you have the proof of concept of LLM REST API that’s open source and ready to scale into your enterprise Kubernetes Cluster.

--

--