My local AI journey - Part 2
A couple of weeks ago, I wrote about my first steps running LLM's locally. This blog will be the continuation of that story. Be warned: These blogs are not really meant as a guide, but more as a collection of links with some context.
I soon figured out that the Ollama library of models was quite limited. Through DuckDuckGo'ing, I figured out that there were many more models available, specifically through hugging face. Donato Capitella, who've I've mentioned a couple of time now on this blog, also mentioned runnig a 235 billion parameter model, using about 100gb.
But how to get this up-and-running? Well, I ended up running these bigger models through Docker. Donato Capitella actually prepares docker images using the latest drivers including llama.cpp server, which allows you to run many if not most of the models available on Hugging Face.
I also found out about Unsloth, which is a community that's (among other
things) quantizing models in order for them to run on systems with relatively limited resources.
They actually have a lot of guides to use their models as well. I am currently running
their Qwen3-Coder-Next model, using more or less
their guide (except for the installation of llama.cpp, but more on that in a bit). The quantized
UD-Q8_K_XL version takes about 90gb, which fits perfectly on my machine (You do need to configure
some kernel parameters to provide enough video memory (
see here)).
Last part of the puzzle is a (web)application to actually interact with these AI's. llama.cpp does have a web client, but from what I saw, it's quite limited. I am currently using OpenWebUI, as well as the local AI client of IntelliJ. OpenWebUI allows you to configure tools and RAG's, which I have messed around with, and may write a bit on at a later date.
Containerizing these apps seemed the easiest way to go forward, so I ended up with a docker compose that looks (partly) like this:
services:
openWebUI:
image: ghcr.io/open-webui/open-webui:v0.8.3
container_name: open-webui
ports:
- "3000:8080"
volumes:
- ./data/openWebUI:/app/backend/data
restart: 'unless-stopped'
llama.ccp:
image: kyuz0/amd-strix-halo-toolboxes:rocm-7.2
container_name: llama.ccp
ports:
- "8082:8082"
volumes:
- /<home>/models:/models # the folder where your models are stored should be mounted
restart: 'unless-stopped'
entrypoint: /bin/bash
command: -c "llama-server --models-dir /models -c 50000 -ngl 999 -fa 1 --no-mmap --port 8082 --host 0.0.0.0" # host in order to expose the api to the network
tty: true
devices: # required to expose the gpu to docker
- /dev/kfd
- /dev/dri
security_opt:
- seccomp=unconfined
That's it for now! Next step is to experiment with (open source) AI agents, using tools and RAGs. More on that in the future!