23 Jun 2023

Running a Hugging Face Large Language Model (LLM) locally on my laptop

I’ve been playing around with a bunch of Large Language Models (LLMs) on Hugging Face and while the free inference API is cool, it can sometimes be busy, so I wanted to learn how to run the models locally. That’s what we’ll be doing in this blog post.

I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below:

You’ll need to install the following libraries if you want to follow along:

pip install 'langchain[llms]' huggingface-hub langchain transformers

The first step is to choose a model that you want to download. I quite like lmsys/fastchat-t5-3b-v1.06 so we’re gonna use that one for the rest of the post.

Downloading the LLM

We can download a model by running the following code:

from huggingface_hub import hf_hub_download

HUGGING_FACE_API_KEY = "<hugging-face-api-key-goes-here>"

# Replace this if you want to use a different model
model_id = "lmsys/fastchat-t5-3b-v1.0"
filenames = [ (1)
    "pytorch_model.bin", "added_tokens.json", "config.json", "generation_config.json",
    "special_tokens_map.json", "spiece.model", "tokenizer_config.json"
]

for filename in filenames:
    downloaded_model_path = hf_hub_download(
        repo_id=model_id,
        filename=filename,
        token=HUGGING_FACE_API_KEY
    )

    print(downloaded_model_path)

print(downloaded_model_path)

1	I worked out the filenames by browsing Files and versions on the Hugging Face UI.

This model is almost 7GB in size, so you probably want to connect your computer to an ethernet cable to get maximum download speed! As well as downloading the model, the script prints out the location of the model. In my case it’s the following:

Output

/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/pytorch_model.bin
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/added_tokens.json
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/config.json
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/generation_config.json
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/special_tokens_map.json
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/spiece.model
/Users/markhneedham/.cache/huggingface/hub/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024/tokenizer_config.json

Any files that you don’t explicitly download like this will be downloaded the first time that you use the model. I’m downloading everything separately so that I don’t have to unexpectedly have to wait for things later on!

Running the LLM

We’re now going to use the model locally with LangChain so that we can create a repeatable structure around the prompt. Let’s first import some libraries:

from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

And now we’re going to create an instance of our model:

model_id = "lmsys/fastchat-t5-3b-v1.0"
llm = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task="text2text-generation",
    model_kwargs={"temperature": 0, "max_length": 1000},
)

The value that we use for task needs to match the label of the model that’s just underneath the model name on the Hugging Face UI.

Figure 1. The task type for this model

Now let’s create a template for what we want the LLM to do when we send it a prompt:

template = """
You are a friendly chatbot assistant that responds conversationally to users' questions.
Keep the answers short, unless specifically asked by the user to elaborate on something.

Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

Next, let’s create a little function that asks a question and prints the response:

def ask_question(question):
    result = llm_chain(question)
    print(result['question'])
    print("")
    print(result['text'])

We’ll also create a (ChatGPT generated) Timer context manager to make it easier to see how long it takes to answer each question:

import time

class TimerError(Exception):
    """A custom exception used to report errors in use of Timer class"""

class Timer:
    def __init__(self):
        self._start_time = None

    def __enter__(self):
        if self._start_time is not None:
            raise TimerError(f"Timer is running. Use .stop() to stop it")
        self._start_time = time.perf_counter()

    def __exit__(self, exc_type, exc_val, exc_tb):
        if self._start_time is None:
            raise TimerError(f"Timer is not running. Use .start() to start it")
        elapsed_time = time.perf_counter() - self._start_time
        self._start_time = None
        print(f"Elapsed time: {elapsed_time:0.4f} seconds")

Now let’s see how well the model knows London:

with Timer():
    ask_question("Describe some famous landmarks in London")

Output

Describe some famous landmarks in London

<pad> Some  famous  landmarks  in  London  include:
 *  Buckingham  Palace
 *  St.  Paul's  Cathedral
 *  The  Tower  of  London
 *  The  London  Eye
 *  The  London  Eye  is  a  giant  wheel  that  flies  over  London.

 Elapsed time: 17.7592 seconds

I’m not sure about that last bullet, but I do like the idea of a giant wheel flying over the city! Let’s try something else:

with Timer():
    ask_question("Tell me about Apache Kafka in a few sentences.")

Output

Tell me about Apache Kafka in a few sentences.

<pad> Apache  Kafka  is  a  distributed  streaming  platform  that  allows  for  the  real-time  processing  of  large  amounts  of  data.  It  is  designed  to  be  scalable,  fault-tolerant,  and  easy  to  use.

Elapsed time: 15.7795 seconds

Not too bad. It doesn’t do so well if I ask about Apache Pinot though!

with Timer():
    ask_question("Tell me about Apache Pinot in a few sentences.")

Output

Tell me about Apache Pinot in a few sentences.

<pad> Apache  Pinot  is  a  Java  framework  for  building  web  applications  that  can  handle  a  wide  range  of  tasks,  including  web  development,  database  management,  and  web  application  testing.

Elapsed time: 13.6518 seconds

It’s also nowhere near as fast as ChatGPT, but my computer isn’t as good as the ones that they use!

Having said that, it is pretty cool to be able to run this type of thing on your own machine and I think it could certainly be useful if you want to ask questions about your own documents that you don’t want to send over the internet.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.