Getting started with LLMs locally
So you’ve heard all the hype about Large Language Models (LLM) and want to run one yourself locally. This guide details how
Why would I want to run it locally?
- You want to peer behind the curtain of what’s actually happening
- You don’t want you prompts to be sent to a third party
- You want to test the LLM with sensitive intellectual property
- Why not?
Prerequisites
- Ubuntu (ish)
- Huge amounts of ram, or a huge swapfile/swap partition. 32GB+ is commonly required
- Python experience
- A fast internet connection
- 50GB+ free storage space
Getting started guide
Install Docker
If you have an nvidia GPU configured with cuda you might be able to follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html to use GPU acceleration.
Install git and git lfs and set it up.
sudo apt install git git-lfs
git lfs install
Download the docker image with all the tools you need, this is ~ 10GB
docker pull huggingface/transformers-all-latest-gpu
Discover the model you want to play with on Hugging Face such as https://huggingface.co/cerebras/Cerebras-GPT-2.7B
You can download it by cloning the git repo (~11GB)
git clone https://huggingface.co/cerebras/Cerebras-GPT-2.7B
Now to run the model
docker run -it --rm -v $(pwd):/model docker.io/huggingface/transformers-all-latest-gpu
# From now all commands are inside the container
cd /model
# start a python interpreter to use
python3
# from now all commands are inside the python interpreter
Actually playing with the model
Now for the Python, in the REPL terminal.
# This stage will take a while as it loads the model into ram, and may lead to it getting OOMK
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./Cerebras-GPT-2.7B")
model = AutoModelForCausalLM.from_pretrained("./Cerebras-GPT-2.7B")
# convenience function to wrap the steps from prompt to reponse
def model_on_prompt(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5,
max_new_tokens=50, early_stopping=True,
no_repeat_ngram_size=2)
text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_output[0])
# Now to run the model on your prompt
model_on_prompt("What can I use LLMs for?")