maartenjan.dev

Hi! I'm Maarten-Jan, a Software Engineer focused on Java, Kotlin, Event Sourcing, Kubernetes and of course Bash! I sometimes write down what I do.

My local AI journey - Part 3

A couple of weeks ago, I wrote about llama.cpp within a Docker setup. I also mentioned I started using the Qwen3-Coder-Next model. In this blog I wanted to share some of my experience using this (and other) models.

First off, I download models using the hugging face cli. Downloading a specific quantisation of a specific model can be done by a command like:

hf download unsloth/Qwen3-Coder-Next-GGUF --local-dir /home/<username>/models/Qwen3-Coder-Next-GGUF --include *UD-Q8_K_XL*

If you're running an ssh connection to a remote machine to download, you may want to do this asynchronously, given the fact that this download may take a while. This can be done with nohup.

nohup <hf command> > output.log 2>&1 &

Replace <hf command> with the specific download command. This puts the output and errors in the output.log file.

Having downloaded the model, I obviously started using is. I found the unsloth/Qwen3-Coder-Next-GGUF UD-Q8_K_XL quantisation quite usable. It's pretty fast (about 35 tokens per second), and it uses about 95Gi of memory, with a context window of 50.000 tokens.

In the last couple of weeks, I used it mostly through OpenWebUI or the mobile android client Conduit, an OpenWebUI client (not plugging into the models directly). Having a searchable history of queries available was useful sometimes.

I also started experimenting with OpenCode, an open source AI agent. I did not pick OpenCode for any specific reason, other than that it is an open source tool.

I have not been using the agent as much as I probably should, but from what I've seen, I'm quite impressed. The setup I got now feels pretty usable. I mostly use OpenCode to review my code and create tests. And it performs well enough to save me time!