Google Gemini “LLama surround by torch”

Convert Pytorch Model to Quantize GGUF to Run on Ollama

Pytorch Model(Bonito)->GGUF->Quantize for your local inference using Ollama

Sarin Suriyakoon
4 min readMar 29, 2024

--

Background

Most of the Huggingface model is in Pytorch format, it is sometimes difficult for me to run on my local with all the underlying knowledge involved. Here I propose a slightly easier way to convert to GGUF and Quantize it then run it with Ollama, plus a few examples of huggingface usage. These small pieces of knowledge will serve you well in the long run.

Here are a few points before we start

  • Quantization is a way to reduce resource usage when running a model. This suits me well because I always use the model on my local laptop with minimum RAM and only Intel Iris Plus Graphics 1536 MB
  • Suppose you find a Pytorch model that looks interesting from huggingface and want to try it out on Ollama or GPT4All. This guide is for you.
  • We will use the Bonito model in this example because it has such a great use case. It helps turn unstructured text into dataset format, ready to train. It also has such a different template from other models, which makes it a little difficult to write Modelfile, which is great to help us learn.

Overview

Tools

Download, Setup and Skim around these tools documentation

Ready? Let’s go!

Download a Source Model

huggingface-cli download BatsResearch/bonito-v1
cp -r [hugginface saved path] [new easy to remember path]/bonito

Convert to GGUF

git clone https://github.com/ggerganov/llama.cpp/tree/master
cd llama.cpp
pip install -r reqiurement.txt
cd ..
python llama.cpp/convert.py bonito

Quantize to Q4_K_M

cd llama.cpp
make
./quantize [your path to]/bonito/ggml-model-f32.gguf [your path to]/bonito/ggml-model-f32.gguf-Q4_K_M.gguf Q4_K_M

Upload to Huggingface

cd ~/bonito-gguf
huggingface-cli upload bonito-gguf . .

Use it in Ollama(Two methods provided)

There are two ways.

  • First, Import locally, create Modelfile and add this content(at FROM use your local path)
FROM ./ggml-model-f32.gguf-Q4_K_M.gguf
TEMPLATE """<|tasktype|>
{{ if .System }}{{ .System }}{{ end }}
<|context|>
{{ .Prompt }}
<|task|>
"""

PARAMETER stop "<|task|>"
PARAMETER stop "<|context|>"
PARAMETER stop "<|tasktype|>"

Run this command on your terminal

ollama run pacozaa/bonito

Try to add an article content to the command like this

ollama run pacozaa/bonito "RAG is a technique for augmenting LLM knowledge with additional data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

LangChain has a number of components designed to help build Q&A applications, and RAG applications more generally."

Usage

Setting the proper task type is crucial for Bonito.

Here are the task types we can use

SHORTFORM_TO_FULL_TASK_TYPES = {
"exqa": "extractive question answering",
"mcqa": "multiple-choice question answering",
"qg": "question generation",
"qa": "question answering without choices",
"ynqa": "yes-no question answering",
"coref": "coreference resolution",
"paraphrase": "paraphrase generation",
"paraphrase_id": "paraphrase identification",
"sent_comp": "sentence completion",
"sentiment": "sentiment",
"summarization": "summarization",
"text_gen": "text generation",
"topic_class": "topic classification",
"wsd": "word sense disambiguation",
"te": "textual entailment",
"nli": "natural language inference",
}

Check out their implementation here.

To set the task type with Ollama, we need to use >>>/set system after ollama run pacozaa/bonito

For example

>>>/set system question generation

>>>[paste your long unstructured text as an input]

Conclusion

With these steps and examples, you now learn how to download a Huggingface Pytorch model, convert it to GGUF, quantize it, contribute/upload it on Huggingface then run it with Ollama.

Also, with Bonito, you can use it to generate datasets from unstructured text. Just run Bonito with Ollama, and use LangChain to organize the dataset generation.

Source and Credit

--

--