Deploy Ollama on Local Kubernetes in 15 minutes

5 min readMar 12, 2024

Let’s deploy Ollama(LLM REST API) to your local Kubernetes

Overview

Background
Tools
Install MicroK8s
Add Kubernetes Resource from Dashboard
Check the Dashboard
Run Port Forward
Test if it works!
Conclusion
Source

Background

Scalability and High Availability of your services are crucial in business and real-world scenarios. Kubernetes is the tool of this era to orchestrate this task. Imagine if you could deploy LLM as a REST API(with multiple models to choose from) and it is ready to scale.

Sounds too good to be true, eh?

Well, enter Ollama + Kubernetes.

Tools

We are going to demonstrate this combination of tools on your local MacOS. If you are using Linux and Windows, both tools are supported please check the official documentation.

MicroK8s — This tool allows you to run Kubernetes locally. I prefer this over Minikube because it’s lightweight, easy to install, and good for quick and easy experiments. It also comes with a Dashboard which we will use today. The dashboard is going to help you operate Kubernetes locally with its Web UI.

Ollama — This is a great tool for experimenting with and using the Large Language Model(LLM) as a REST API without scientists or extensive AI coding knowledge. Also, the list of models that can be used in Ollama is ever-growing(Gemma, Mistral, Orca, LLama2 and many more). Check the its library here.

With LLM as a REST API, you can imagine, you’d scale it like any other service on Kubernetes.

Ollama itself is optional because we will pull its image inside the Kubernetes anyway. It is nice to have because you can test our Kubernetes service with Ollama CLI.

Let’s get cooking.

Install MicroK8s

Run this series of commands in your terminal

Check out https://microk8s.io/docs/install-macos

or use a consolidated list here

#Download
brew install ubuntu/microk8s/microk8s
#Run installer
microk8s install
#Check the status while Kubernetes starts
microk8s status --wait-ready
#Turn on Dashboard service
microk8s enable dashboard
#Start using Kubernet
microk8s kubectl get all --all-namespaces
#Start using Dashboard
microk8s dashboard-proxy

#Utility command
#To Start and Stop
microk8s start 
microk8s stop

After running microk8s dashboard-proxy

you should see the URL and token needed to go into the dashboard.

Dashboard info — copy and paste URL into your browser

Open the URL in a browser and should see the Dashboard. We are ready to add Kubernetes resources.

Add Kubernetes Resource from Dashboard

Click the + plus sign on the top right of the screen, you should see a form to add Kubernetes Resource

Add the following Namespace resource

Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: ollama

Then select this ollama namespace on the top left beside the Kubernetes logo

Let’s continue adding Deployment and Service resources in the aforementioned form.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP

Service

apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    name: ollama
  ports:
  - port: 80
    name: http
    targetPort: http
    protocol: TCP

Check the Dashboard

You should see Deployment and Service running in the dashboard

Run Port Forward

Wait until ollama deployment and service are green then run

microk8s kubectl -n ollama port-forward service/ollama 11434:80

This will forward any localhost:11434 request to service port 80.

localhost:11434 is the Ollama standard address and port.

Test if it works!

Make sure the Ollama app is closed and any ollama serve is terminated on your local. Only microK8s should be running ollama. Then run this

ollama run orca-mini:3b

or with CURL

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
}'

The first time you run this, it will take some time to pull the model. Go to the dashboard again and watch the pod log by clicking

Pods

click ollama-[hash key]

and click

You should see the log like this

You should see either ollama CLI or curl return something after the model is loaded. Start Chatting!

Conclusion

Now you have the proof of concept of LLM REST API that’s open source and ready to scale into your enterprise Kubernetes Cluster.

Source

MicroK8s - Installing MicroK8s on macOS | MicroK8s

MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux…

microk8s.io

ollama/examples/kubernetes at main · ollama/ollama

Get up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/examples/kubernetes at main…

github.com

Deploy Ollama on Local Kubernetes in 15 minutes

Overview

Background

Tools

Install MicroK8s

Add Kubernetes Resource from Dashboard

Namespace

Deployment

Service

Check the Dashboard

Run Port Forward

Test if it works!

Conclusion

Source

MicroK8s - Installing MicroK8s on macOS | MicroK8s

MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux…

ollama/examples/kubernetes at main · ollama/ollama

Get up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/examples/kubernetes at main…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sarin Suriyakoon

Responses (5)

More from Sarin Suriyakoon

Use Unsloth LORA Adapter with Ollama in 3 Steps

Use LLama.Cpp to convert Unsloth Lora Adapter to GGML(.bin) and use it in Ollama — with a single GPU

Run Ollama — LLM on your Local and Measure its Metrics

Deploy your LLM at no cost. Available for Nodejs, Python, Docker, Rest API, and tons of apps on various platforms.

Quickest RAG Setup - AWS Bedrock Knowledge Base in 5 minutes

Create your LLM chatbot API with RAG ready for your product without fine-tuning, yes in 5 minutes.

Convert Pytorch Model to Quantize GGUF to Run on Ollama

Pytorch Model(Bonito)->GGUF->Quantize for your local inference using Ollama

Recommended from Medium

Deploying and Managing Ollama Models on Kubernetes: A Comprehensive Guide

Deploying machine learning models can be challenging, especially when aiming for scalable and maintainable deployments. Kubernetes (K8s)…

Running K8SGPT with Ollama Inside Your Kubernetes Cluster: A Complete Guide

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Deploying Large Language Models on Kubernetes

Overview

Fine-Tuning DeepSeek-R1 on Consumer Hardware: A Step-by-Step Guide 🤖✨🔥

Fine-tuning large-scale AI models like DeepSeek-R1 can be resource-intensive, but with the right tools, it’s possible to train efficiently…

Building a Local LLM Server: How to Run Multiple Models Efficiently

I’m writing a series on how to utilize a home Ubuntu server as a local LLM server. While working on LLM projects, I’ve pondered how…