Convert Pytorch Model to Quantize GGUF to Run on Ollama
Pytorch Model(Bonito)->GGUF->Quantize for your local inference using Ollama
Background
Most of the Huggingface model is in Pytorch format, it is sometimes difficult for me to run on my local with all the underlying knowledge involved. Here I propose a slightly easier way to convert to GGUF and Quantize it then run it with Ollama, plus a few examples of huggingface usage. These small pieces of knowledge will serve you well in the long run.
Here are a few points before we start
- Quantization is a way to reduce resource usage when running a model. This suits me well because I always use the model on my local laptop with minimum RAM and only
Intel Iris Plus Graphics 1536 MB
- Suppose you find a Pytorch model that looks interesting from huggingface and want to try it out on Ollama or GPT4All. This guide is for you.
- We will use the Bonito model in this example because it has such a great use case. It helps turn unstructured text into dataset format, ready to train. It also has such a different template from other models, which makes it a little difficult to write
Modelfile
, which is great to help us learn.
Overview
- Tools
- Download a source model — We will use https://huggingface.co/BatsResearch/bonito-v1
- Convert to GGUF
- Quantize to GGUF Q4_K_M
- Upload to Huggingface
- Use it in Ollama(two methods provided)
Tools
Download, Setup and Skim around these tools documentation
Ready? Let’s go!
Download a Source Model
huggingface-cli download BatsResearch/bonito-v1
cp -r [hugginface saved path] [new easy to remember path]/bonito
Convert to GGUF
git clone https://github.com/ggerganov/llama.cpp/tree/master
cd llama.cpp
pip install -r reqiurement.txt
cd ..
python llama.cpp/convert.py bonito
Quantize to Q4_K_M
cd llama.cpp
make
./quantize [your path to]/bonito/ggml-model-f32.gguf [your path to]/bonito/ggml-model-f32.gguf-Q4_K_M.gguf Q4_K_M
Upload to Huggingface
cd ~/bonito-gguf
huggingface-cli upload bonito-gguf . .
Use it in Ollama(Two methods provided)
There are two ways.
- First, Import locally, create Modelfile and add this content(at
FROM
use your local path)
FROM ./ggml-model-f32.gguf-Q4_K_M.gguf
TEMPLATE """<|tasktype|>
{{ if .System }}{{ .System }}{{ end }}
<|context|>
{{ .Prompt }}
<|task|>
"""
PARAMETER stop "<|task|>"
PARAMETER stop "<|context|>"
PARAMETER stop "<|tasktype|>"
- Second, use the model from the Ollama library here https://ollama.com/pacozaa/bonito
Run this command on your terminal
ollama run pacozaa/bonito
Try to add an article content to the command like this
ollama run pacozaa/bonito "RAG is a technique for augmenting LLM knowledge with additional data.
LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).
LangChain has a number of components designed to help build Q&A applications, and RAG applications more generally."
Usage
Setting the proper task type is crucial for Bonito.
Here are the task types we can use
SHORTFORM_TO_FULL_TASK_TYPES = {
"exqa": "extractive question answering",
"mcqa": "multiple-choice question answering",
"qg": "question generation",
"qa": "question answering without choices",
"ynqa": "yes-no question answering",
"coref": "coreference resolution",
"paraphrase": "paraphrase generation",
"paraphrase_id": "paraphrase identification",
"sent_comp": "sentence completion",
"sentiment": "sentiment",
"summarization": "summarization",
"text_gen": "text generation",
"topic_class": "topic classification",
"wsd": "word sense disambiguation",
"te": "textual entailment",
"nli": "natural language inference",
}
Check out their implementation here.
To set the task type with Ollama, we need to use >>>/set system
after ollama run pacozaa/bonito
For example
>>>/set system question generation
>>>[paste your long unstructured text as an input]
Conclusion
With these steps and examples, you now learn how to download a Huggingface Pytorch model, convert it to GGUF, quantize it, contribute/upload it on Huggingface then run it with Ollama.
Also, with Bonito, you can use it to generate datasets from unstructured text. Just run Bonito with Ollama, and use LangChain to organize the dataset generation.