LLM Hosting Beyond One User - A Guide for Small Groups

In our previous post, we provided a crash course on getting started with LLMs locally using ollama and explored models that can run efficiently on end-user systems. We wrapped up by showing you how to launch a chat UI and start chatting away!

Expanding Beyond a Single User

This time, we’re taking things a step further: hosting multiple LLMs for a small group, like an office team in a small-to-medium enterprise, your local club, or your tax advisory colleagues. This guide is perfect for any small group (2-5 people) looking to dive into LLMs without having their conversations stored in the cloud.

Compared to our last guide, we’ll be expanding the setup in two key ways:

  1. Multiple Models: We’ll show you how to use multiple models on a budget, even without a high-end workstation.
  2. Concurrent Users: We’ll enable multiple users to access the server simultaneously, processing requests concurrently.

Let’s start by reviewing what we already have: ollama is fantastic, but it’s designed for a single user on a single machine. In fact, ollama wraps the well-known llama.cpp project, which means it shares the same limitation: it processes only one request at a time. While this works well in a single-user setup, it can cause delays in a group setting, as everyone waits for their turn, even if your machine has spare capacity.

Continuous Batching

The key feature that addresses this issue is continuous batching. Here’s a simple way to think about it: LLMs are essentially deep neural networks that process information in layers. Imagine an assembly line in a car factory—multiple stations work in parallel, each adding parts to different cars as they move down the line. As long as there are free resources (like workers at stations), the LLM engine can accept new tasks and push them through the system layer by layer, allowing it to handle new requests as soon as one task is complete. Even better, thanks to the KV cache, if the requests share similar text, they might be processed even faster together!

Lesson learned: To serve a group, you’ll need a server that supports continuous batching. Based on our experiences at HPCLabs, we recommend two open-source projects:

  1. vllm: A production-grade LLM server that offers high performance, squeezing in requests until your machine is fully loaded.
  2. aphrodite: A more agile and feature-rich option than vllm, though with slightly less stability.

For a straightforward setup, we recommend using vllm. In the remainder of this post, we’ll guide you through setting up vllm to host the Llama 3.1 8B model.

Setting Up vllm

For this setup, we assume you have a machine with a recent NVIDIA GPU (e.g., RTX 3090) with at least 8 GB of VRAM (for smaller models, refer to our previous post). Start by installing vllm in a separate Python virtual environment to avoid conflicts with your system:

python -m venv venv

source venv/bin/activate

pip install vllm

Installation might take some time due to the number of dependencies. Once done, you can start the server with the following command:

python -m vllm.entrypoints.openai.api_server --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 -q awq --dtype auto --gpu-memory-utilization 0.98 --served_model_name llama3-8b --max_model_len 16384 --host 0.0.0.0 --port 5000

That might look complex, but don’t worry—we’ll break it down later. When you start vllm, it will download the model from Hugging Face. After a while, you’ll see this message in the terminal:

INFO:     Uvicorn running on http://0.0.0.0:5000

(Press CTRL+C to quit)

You’re ready to go! Fire up your ChatUI from last time and start chatting—the UI will automatically detect the llama3-8b model. For extra fun, invite your friends on the network to join in, and watch as everyone’s questions are answered instantly!

Congratulations, you’ve just laid the foundation for your group’s personal AI assistant. If you’re only interested in using it, you can stop here. In the next section, we’ll dive into some details about vllm.

Details

One thing you’ll notice after launching vllm is that it allocates almost all of your GPU’s VRAM (as seen in nvidia-smi):

If you look closely, this was intentional: --gpu-memory-utilization 0.98 tells vllm to grab 98% of your VRAM and hold onto it. This is actually beneficial because it allows multiple requests to fit into VRAM right away, so if you can spare the VRAM, we recommend it.

Let’s break down a few more parameters we used:

  1. --host 0.0.0.0 --port 5000: This is crucial! The host 0.0.0.0 tells your operating system to accept requests from other machines.
  2. --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 -q awq --dtype auto: This not only selects a 4-bit quantized version of the Llama 3.1 8B model but also specifies the quantization method (awq).
  3. --served_model_name llama3-8b: This allows you to name the model, making it easy to check if the server is running—just execute curl http://[your-server]:5000/v1/models and ensure the name appears!

Finally, if others can’t find the LLM on the network, check your firewall settings. If you need help, feel free to contact us at HPCLabs.

Multiple Models

But wait, didn’t we promise to extend the setup to multiple models too? Yes, indeed! However, vllm alone can only host one model at a time. When you have multiple users and limited GPUs, you’ll need to swap models in and out.

While vllm can’t do this by itself, there are other projects on top of it that do. One of them is TabbyAPI, which we will present in a follow-up post, so stay tuned! In the meantime, enjoy your group’s new AI assistant.

PS: For added privacy, if you don’t need to see your colleagues’ requests, start vllm with --disable-log-requests.

© Copyright 2024 HPCLabs
Privacy policyLegal Notice
Website made by MahnaMahna