How to get started with LLMs in your 4 walls: a quick guide

LLMs are the hot topic, having been since the launch of ChatGPT. The area moves at a staggering pace, with each day bringing new models, frameworks, and announcements. However, in the wake of this explosion (we refuse to call it “hype”), at least in Europe, we’ve also been a bit prudent and critical: If we use LLMs in the cloud, who’s collecting our data? You don’t want to be the person who hands over company secrets to ChatGPT, only to have them reappear in someone else’s presentation (this actually happened, see here).

Instead, thanks to open-source (or, at least, open-weights) models such as the incredible LLaMa 3, you can run LLMs perfectly fine on your own machine and be in control of your data, too! As you might have heard, LLMs are power-hungry and have an especially burning appetite for your graphics cards’ memory - these billions of parameters have to be stored somewhere. Fear not, but again look to Open Source: the community has quickly found ways to reduce memory requirements, through a process called quantization. You can imagine this as restricting the number of decimal digits for each parameter of the network. Clearly, more aggressive quantization can lead to wrong results quickly, if you overdo it, this is what you might get:

In our experience, quantizing the weights to 4-bit (“q4”) down from 16-bit (“fp16”) does do some damage, but the networks are still perfectly usable. In return, we get to run larger (and thus better!) LLMs on your machine. While we see new LLMs and fine-tunes (see further down) every day, we’ve hand-picked a few that we’ve been driving for a long time. The following table shows some recommended LLMs for various applications. An “x” means that this system can handle the LLM.

Please note that while some systems can run larger networks, it might not be wise to do so in terms of execution speed. Similarly, this table is dramatically simplified - the memory requirements also depend on the length of the inputs (the context size).

Now, with all that being out of the way, let’s finally run one of these LLMs. For this, we’ll use ollama, a beginner-friendly LLM-server-in-a-box. Please follow the install instructions on the “Download” page.

Once the installation process is complete, a simple:

ollama run phi3

is sufficient to execute the following steps:

• Download a quantized version of the model including required data such as the prompt template (basically a template string that tells ollama how to format your messages in order for the LLM to understand the structure of the conversation)

• Move the model to the GPU (if you have one installed)

• Start a conversation on the command line (seriously, just start typing and enjoy!)

However, the real kicker is invisible: an OpenAI-compatible server is started in the background, on port 11434. Thanks to this, you can plug your local LLM now into most apps that work with the OpenAI API. This enables countless possibilities - stay tuned for the next posts in which we present a few plug-and-play examples (with code!).

For now, let’s focus again on chat. Ollama’s chat in the terminal is fine, but we want more comfort - multiple conversations, and storage - think ChatGPT Interface. Luckily, Open Source provides again. If you have docker installed, issue the following command:

docker run -d --network=host -e OLLAMA_BASE_URL=http://127.0.0.1:11434 -e WEBUI_AUTH=False -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

After about 30 secs, open your browser to http://localhost:8080, select your model in the top left corner, and start chatting!

This concludes our short getting started guide. Watch out for follow-up posts in which we’ll explore some applications in which your LLMs come in handy. Until then, feel free to play around and push your machine to the limit!

PS: Down below the “AI” layer, what your machine really executes are classic HPC operations. In case your LLM runs too slow, feel free to reach out - after all, performance tuning for HPC is our core strength at HPCLabs!

© Copyright 2024 HPCLabs
Privacy policyLegal Notice
Website made by MahnaMahna