Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Getting started with LLMs locally

So you’ve heard all the hype about Large Language Models (LLM) and want to run one yourself locally. This guide details how

Why would I want to run it locally?

  • You want to peer behind the curtain of what’s actually happening
  • You don’t want you prompts to be sent to a third party
  • You want to test the LLM with sensitive intellectual property
  • Why not?

Prerequisites

  • Ubuntu (ish)
  • Huge amounts of ram, or a huge swapfile/swap partition. 32GB+ is commonly required
  • Python experience
  • A fast internet connection
  • 50GB+ free storage space

Getting started guide

Install Docker

If you have an nvidia GPU configured with cuda you might be able to follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html to use GPU acceleration.

Install git and git lfs and set it up.

sudo apt install git git-lfs
git lfs install

Download the docker image with all the tools you need, this is ~ 10GB

docker pull huggingface/transformers-all-latest-gpu

Discover the model you want to play with on Hugging Face such as https://huggingface.co/cerebras/Cerebras-GPT-2.7B

You can download it by cloning the git repo (~11GB)

git clone https://huggingface.co/cerebras/Cerebras-GPT-2.7B

Now to run the model

docker run -it --rm  -v $(pwd):/model docker.io/huggingface/transformers-all-latest-gpu
# From now all commands are inside the container
cd /model
# start a python interpreter to use
python3
# from now all commands are inside the python interpreter

Actually playing with the model

Now for the Python, in the REPL terminal.

# This stage will take a while as it loads the model into ram, and may lead to it getting OOMK
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./Cerebras-GPT-2.7B")
model = AutoModelForCausalLM.from_pretrained("./Cerebras-GPT-2.7B")

# convenience function to wrap the steps from prompt to reponse
def model_on_prompt(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(**inputs, num_beams=5,
                        max_new_tokens=50, early_stopping=True,
                        no_repeat_ngram_size=2)
    text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(text_output[0])

# Now to run the model on your prompt
model_on_prompt("What can I use LLMs for?")

Copyright © 2023 Richard Finlay Tweed. All rights reserved. All views expressed are my own