Serving Multiple AI Models Effortlessly

In our previous post, we introduced vllm, a state-of-the-art inference engine that is production-ready, along with key concepts like layers and continuous batching. We hope by now you’ve been able to set up vllm and have enjoyed using it! We think vllm is fantastic—it's been an essential part of our workflow for a while now.

Why Self-Host?

At HPClabs, we love the productivity boost we get from using LLMs for tasks like coding and more. However, contrary to what advertising might suggest, this wasn’t an easy journey. We had to experiment with various models, tools, and workflows. Our setup includes an AMD Ryzen 5800X, 64 GB RAM, a GTX 3090, and a GTX TITAN RTX—a pretty solid machine. Yet, we can tell you firsthand that even with this setup, it’s not on par with the top-tier offerings like ChatGPT or Claude 3.5 from Anthropic, no matter what models you use.

That said, the drop in LLM inference costs for the top models since the launch of the original ChatGPT has been nothing short of dramatic - approx 240 times since the launch of ChatGPT for the top models!

Still, there are valid reasons to consider self-hosting. Among them are:

- Data Ownership: Your data remains yours, and only yours. No data is sent to the cloud or used as training material for future models.

- Cost Management: If you already have decent hardware or are willing to make an upfront investment, hosting your own inference can help cap your costs.

- Hands-On Experience: We firmly believe that to truly learn and understand something, you need to get your hands dirty. Self-hosting allows you to observe, tune, and adjust models to better suit your specific needs.

As Mitko Vasilev eloquently states on LinkedIn: "Make sure you own your AI. AI in the cloud is not aligned with you; it's aligned with the company that owns it."

Serving Multiple Models with Ease

It’s clear that serving LLMs for multiple users requires both power and VRAM. While most of us don’t have a $500,000 DGX machine in the basement, we still want to explore and experiment with different models to see which are most useful for daily tasks. The solution? Dynamic loading and unloading of models.

With dynamic model management, your server loads one model at a time to serve clients. When a different model is requested, it unloads the current model and loads the new one.

This setup isn’t ideal for larger teams, but it lets you explore different models without manually switching between them. Fortunately, the open-source community offers great tools for this. One such tool is TabbyAPI, part of the TabbyML project. It’s a simple API server that manages other inference engines, enabling you to serve multiple models and dynamically load/unload them upon request.

Setting Up TabbyAPI

Let’s walk through setting up TabbyAPI. Imagine you’re using a machine with a 24 GB VRAM GPU, like an RTX 3090, and you want to switch between two modes: code and general-purpose. For coding tasks, we recommend Mistral's 22B model, "Codestral," which is great for coding tasks and fits into 4-bit quantization. For general-purpose tasks, we use NVIDIA's 12B NeMo model in 8-bit quantization.

Here’s how to set up TabbyAPI and download the models (ensure your machine has at least 32 GB of RAM, about 50 GB of available storage, and is running a recent Linux distribution). First, download TabbyAPI:

git clone https://github.com/theroyallab/tabbyAPI.git TabbyAPI

If this is your first time doing a large Git checkout, you’ll need to install Git’s Large File System (LFS):

git lfs install

Next, download the models. These checkout commands may take some time since they involve large files:

cd TabbyAPI/models

git checkout git@hf.co:machinez/Codestral-22B-v0.1-exl2 codestral

cd Codestral

git checkout 4_0

cd ../

git checkout https:////huggingface.co/turboderp/Llama-3.1-8B-Instruct-exl2 llama3-small

cd llama3-small

git checkout 6.0bpw

cd ../

Once the downloads are complete, it’s time to configure and launch TabbyAPI. Edit the following settings in config.yml:

network:

        host: 0.0.0.0

        disable_auth: True

        [...]

model:

        model_name: llama3-small

        [...]

This configuration disables the need for an API key and ensures that Llama 3.1 is loaded by default. Now, you can launch Tabby:

./start.sh

If prompted for the CUDA version, choose 12.x, then wait for TabbyAPI to start and report that it’s ready:

INFO:     Uvicorn running on http://0.0.0.0:5000

(Press CTRL+C to quit)

Finally, check if both models are ready for serving:

curl http://localhost:5000/v1/models

Great! You’re now ready to experiment with multiple models using any tool compatible with OpenAI servers, such as OpenWebUI.

Tips for Maximizing Your Setup

By now, after following our previous posts, you should be able to self-host several interesting models, connect them to a UI, or integrate them into scripts and pipelines.

AI, especially open-source AI, has been developing at a staggering pace. New tools and use cases emerge daily. At HPClabs, we’ve embraced these advancements, using AI to improve our workflows. Now, with the power of local AI at your fingertips, here are some useful tools to maximize your productivity:

Code Tools (Act as your AI junior pair programmer!)

Continue.dev: A Visual Studio Code extension that integrates chat and auto-complete directly into the IDE. It allows you to apply code from the chat interface as a diff to your codebase. We recommend using Deepseek’s fast 6.7B model for autocomplete.

Aider: An AI agent for coding. It autonomously identifies the necessary files in your codebase, implements the prompt, and offers the result as a Git commit.

Claude.dev: Another AI agent, originally coupled to Claude, but extended to handle several backends now.

General Tools

OpenInterpreter: Speak directly to your terminal, no more memorizing cryptic shell commands! OI will always ask for confirmation before executing anything.

Obsidian Copilot: Fire off custom prompts filled with notes from your Obsidian vault or chat with your vaults' notes.

Happy hacking!

PS: Codestral currently doesn’t include a prompt template. You can create one by placing the following file in TabbyAPI's templates folder as codestral.jinja:

{% if messages[0]['role'] == 'system' -%}
   {{ bos_token + '[INST] <<SYS>>\n' + messages[0]['content'] | trim + '\n<</SYS>>\n\n' }}
   {% set messages = messages[1:] %}
{% else -%}
   {{ bos_token + '[INST] ' }}
{% endif -%}

{% for message in messages -%}
   {% if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
       {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
   {% endif -%}

   {% if message['role'] == 'user' -%}
       {% if loop.index0 > 0 -%}
           {{ bos_token + '[INST] ' }}
       {% endif -%}
       {{ message['content'] | trim + '[/INST]' }}
   {% elif message['role'] == 'assistant' -%}
       {{ message['content'] | trim + eos_token }}
   {% endif -%}
{% endfor -%}

Then, set

prompt_template: codestral

in your config.yml.

© Copyright 2024 HPCLabs
Privacy policyLegal Notice
Website made by MahnaMahna