Convert Pytorch Model to Quantize GGUF to Run on Ollama

Pytorch Model(Bonito)->GGUF->Quantize for your local inference using Ollama

4 min readMar 29, 2024

Background

Most of the Huggingface model is in Pytorch format, it is sometimes difficult for me to run on my local with all the underlying knowledge involved. Here I propose a slightly easier way to convert to GGUF and Quantize it then run it with Ollama, plus a few examples of huggingface usage. These small pieces of knowledge will serve you well in the long run.

Here are a few points before we start

Quantization is a way to reduce resource usage when running a model. This suits me well because I always use the model on my local laptop with minimum RAM and only Intel Iris Plus Graphics 1536 MB
Suppose you find a Pytorch model that looks interesting from huggingface and want to try it out on Ollama or GPT4All. This guide is for you.
We will use the Bonito model in this example because it has such a great use case. It helps turn unstructured text into dataset format, ready to train. It also has such a different template from other models, which makes it a little difficult to write Modelfile, which is great to help us learn.

Overview

Tools
Download a source model — We will use https://huggingface.co/BatsResearch/bonito-v1
Convert to GGUF
Quantize to GGUF Q4_K_M
Upload to Huggingface
Use it in Ollama(two methods provided)

Tools

Download, Setup and Skim around these tools documentation

Ready? Let’s go!

Download a Source Model

huggingface-cli download BatsResearch/bonito-v1
cp -r [hugginface saved path] [new easy to remember path]/bonito

Convert to GGUF

git clone https://github.com/ggerganov/llama.cpp/tree/master
cd llama.cpp
pip install -r reqiurement.txt
cd ..
python llama.cpp/convert.py bonito

Quantize to Q4_K_M

cd llama.cpp
make
./quantize [your path to]/bonito/ggml-model-f32.gguf [your path to]/bonito/ggml-model-f32.gguf-Q4_K_M.gguf Q4_K_M

Upload to Huggingface

cd ~/bonito-gguf
huggingface-cli upload bonito-gguf . .

Use it in Ollama(Two methods provided)

There are two ways.

First, Import locally, create Modelfile and add this content(at FROM use your local path)

FROM ./ggml-model-f32.gguf-Q4_K_M.gguf
TEMPLATE """<|tasktype|>
{{ if .System }}{{ .System }}{{ end }}
<|context|>
{{ .Prompt }}
<|task|>
"""

PARAMETER stop "<|task|>"
PARAMETER stop "<|context|>"
PARAMETER stop "<|tasktype|>"

Second, use the model from the Ollama library here https://ollama.com/pacozaa/bonito

Run this command on your terminal

ollama run pacozaa/bonito

Try to add an article content to the command like this

ollama run pacozaa/bonito "RAG is a technique for augmenting LLM knowledge with additional data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

LangChain has a number of components designed to help build Q&A applications, and RAG applications more generally."

Usage

Setting the proper task type is crucial for Bonito.

Here are the task types we can use

SHORTFORM_TO_FULL_TASK_TYPES = {
    "exqa": "extractive question answering",
    "mcqa": "multiple-choice question answering",
    "qg": "question generation",
    "qa": "question answering without choices",
    "ynqa": "yes-no question answering",
    "coref": "coreference resolution",
    "paraphrase": "paraphrase generation",
    "paraphrase_id": "paraphrase identification",
    "sent_comp": "sentence completion",
    "sentiment": "sentiment",
    "summarization": "summarization",
    "text_gen": "text generation",
    "topic_class": "topic classification",
    "wsd": "word sense disambiguation",
    "te": "textual entailment",
    "nli": "natural language inference",
}

Check out their implementation here.

To set the task type with Ollama, we need to use >>>/set system after ollama run pacozaa/bonito

For example

>>>/set system question generation

>>>[paste your long unstructured text as an input]

Conclusion

With these steps and examples, you now learn how to download a Huggingface Pytorch model, convert it to GGUF, quantize it, contribute/upload it on Huggingface then run it with Ollama.

Also, with Bonito, you can use it to generate datasets from unstructured text. Just run Bonito with Ollama, and use LangChain to organize the dataset generation.

Source and Credit

bonito/bonito/abstract.py at main · BatsResearch/bonitoA lightweight library for generating synthetic instruction tuning datasets for your data without GPT. …
github.com

Learning to Generate Instruction Tuning Datasets for Zero-Shot Task AdaptationWe introduce Bonito, an open-source model for conditional task generation: the task of converting unannotated text into…
arxiv.org

Tutorial: How to convert HuggingFace model to GGUF format · ggerganov llama.cpp · Discussion #2948Source: https://www.substratus.ai/blog/converting-hf-model-gguf-model/ I published this on our blog but though others…
github.com

ollama/docs/import.md at main · ollama/ollamaGet up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/docs/import.md at main ·…
github.com

ollama/docs/modelfile.md at main · ollama/ollamaGet up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/docs/modelfile.md at main ·…
github.com

Convert Pytorch Model to Quantize GGUF to Run on Ollama

Pytorch Model(Bonito)->GGUF->Quantize for your local inference using Ollama

Background

Overview

Tools

Download a Source Model

Convert to GGUF

Quantize to Q4_K_M

Upload to Huggingface

Use it in Ollama(Two methods provided)

Usage

Conclusion

Source and Credit

bonito/bonito/abstract.py at main · BatsResearch/bonito

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT. …

Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

We introduce Bonito, an open-source model for conditional task generation: the task of converting unannotated text into…

Tutorial: How to convert HuggingFace model to GGUF format · ggerganov llama.cpp · Discussion #2948

Source: https://www.substratus.ai/blog/converting-hf-model-gguf-model/ I published this on our blog but though others…

ollama/docs/import.md at main · ollama/ollama

Get up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/docs/import.md at main ·…

ollama/docs/modelfile.md at main · ollama/ollama

Get up and running with Llama 2, Mistral, Gemma, and other large language models. - ollama/docs/modelfile.md at main ·…

Written by Sarin Suriyakoon