Running a Hugging Face Large Language Model (LLM) locally on my laptop
I’ve been playing around with a bunch of Large Language Models (LLMs) on Hugging Face and while the free inference API is cool, it can sometimes be busy, so I wanted to learn how to run the models locally. That’s what we’ll be doing in this blog post.
I’ve created a video showing how to do this on my YouTube channel, Learn Data with Mark, so if you prefer to consume content through that medium, I’ve embedded it below: |
You’ll need to install the following libraries if you want to follow along:
pip install 'langchain[llms]' huggingface-hub langchain transformers
The first step is to choose a model that you want to download. I quite like lmsys/fastchat-t5-3b-v1.06 so we’re gonna use that one for the rest of the post.
Downloading the LLM
We can download a model by running the following code:
from huggingface_hub import hf_hub_download
HUGGING_FACE_API_KEY = "<hugging-face-api-key-goes-here>"
# Replace this if you want to use a different model
model_id = "lmsys/fastchat-t5-3b-v1.0"
filenames = [ (1)
"pytorch_model.bin", "added_tokens.json", "config.json", "generation_config.json",
"special_tokens_map.json", "spiece.model", "tokenizer_config.json"
for filename in filenames:
downloaded_model_path = hf_hub_download(
1 | I worked out the filenames by browsing Files and versions on the Hugging Face UI. |
This model is almost 7GB in size, so you probably want to connect your computer to an ethernet cable to get maximum download speed! As well as downloading the model, the script prints out the location of the model. In my case it’s the following:
Any files that you don’t explicitly download like this will be downloaded the first time that you use the model. I’m downloading everything separately so that I don’t have to unexpectedly have to wait for things later on! |
Running the LLM
We’re now going to use the model locally with LangChain so that we can create a repeatable structure around the prompt. Let’s first import some libraries:
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
And now we’re going to create an instance of our model:
model_id = "lmsys/fastchat-t5-3b-v1.0"
llm = HuggingFacePipeline.from_model_id(
model_kwargs={"temperature": 0, "max_length": 1000},
The value that we use for ![]() Figure 1. The task type for this model
Now let’s create a template for what we want the LLM to do when we send it a prompt:
template = """
You are a friendly chatbot assistant that responds conversationally to users' questions.
Keep the answers short, unless specifically asked by the user to elaborate on something.
Question: {question}
prompt = PromptTemplate(template=template, input_variables=["question"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
Next, let’s create a little function that asks a question and prints the response:
def ask_question(question):
result = llm_chain(question)
We’ll also create a (ChatGPT generated) Timer context manager to make it easier to see how long it takes to answer each question:
import time
class TimerError(Exception):
"""A custom exception used to report errors in use of Timer class"""
class Timer:
def __init__(self):
self._start_time = None
def __enter__(self):
if self._start_time is not None:
raise TimerError(f"Timer is running. Use .stop() to stop it")
self._start_time = time.perf_counter()
def __exit__(self, exc_type, exc_val, exc_tb):
if self._start_time is None:
raise TimerError(f"Timer is not running. Use .start() to start it")
elapsed_time = time.perf_counter() - self._start_time
self._start_time = None
print(f"Elapsed time: {elapsed_time:0.4f} seconds")
Now let’s see how well the model knows London:
with Timer():
ask_question("Describe some famous landmarks in London")
Describe some famous landmarks in London
<pad> Some famous landmarks in London include:
* Buckingham Palace
* St. Paul's Cathedral
* The Tower of London
* The London Eye
* The London Eye is a giant wheel that flies over London.
Elapsed time: 17.7592 seconds
I’m not sure about that last bullet, but I do like the idea of a giant wheel flying over the city! Let’s try something else:
with Timer():
ask_question("Tell me about Apache Kafka in a few sentences.")
Tell me about Apache Kafka in a few sentences.
<pad> Apache Kafka is a distributed streaming platform that allows for the real-time processing of large amounts of data. It is designed to be scalable, fault-tolerant, and easy to use.
Elapsed time: 15.7795 seconds
Not too bad. It doesn’t do so well if I ask about Apache Pinot though!
with Timer():
ask_question("Tell me about Apache Pinot in a few sentences.")
Tell me about Apache Pinot in a few sentences.
<pad> Apache Pinot is a Java framework for building web applications that can handle a wide range of tasks, including web development, database management, and web application testing.
Elapsed time: 13.6518 seconds
It’s also nowhere near as fast as ChatGPT, but my computer isn’t as good as the ones that they use!
Having said that, it is pretty cool to be able to run this type of thing on your own machine and I think it could certainly be useful if you want to ask questions about your own documents that you don’t want to send over the internet.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.