LocalLLaMA

4890 readers

6 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 3 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

Don't skimp on the quant when using MoE (unsloth.ai)

submitted 1 month ago by troed@fedia.io to c/localllama@sh.itjust.works

3 comments fedilink hide all child comments

Maybe it was just me, but in case others have done the same this post might help someone else too.

I have a workstation with plenty of CPU and system RAM, but I'm "GPU poor" in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.

I'm exclusively using this setup for development and system management tasks, and I've found Qwen 3.6 35B A3B to excel compared to other models. I don't have the VRAM to run the 27GB dense model, so I've spent time on getting the best usage out of the MoE.

Or so I thought. Since "everyone" says to use Unsloth UD-Q4_K_XL that's the quant I've been using, and I've gone a bit back'n'forth with MTP/no MTP, UB increase, mmproj since I've also started using a browser MCP etc.

Today I took another look at their quant chart and thought that since it's MoE maybe I could run Q5_K_S which would be a step up?

Well. Now I'm using Q6_K because it turns out I could run that with the exact same settings as I've optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I've already noticed how it's able to get out of loops while before I had to interrupt it sometimes.

This is my setup. I get >1000 t/s prefill and >20 t/s inference. I'm not chasing faster inference since I actively read the thought process when working the LLM - but I've increased ub to get faster prefill since that's just waiting time otherwise.

./llama-server
    -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
    -c 160000 \
    -n 32768 \
    -fa on \
    -ub 2048 \
    -ctk q8_0 \
    -ctv q8_0 \
    --no-mmap \
    --mlock \
    --no-warmup \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0

I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.

top 3 comments

sorted by: hot top controversial new old

[–] brockhold@lemmy.world 5 points 1 month ago (1 children)

This is exactly why I am so sad that I didn't buy more DDR4 back when it was reasonable. I run Unsloth's Qwen 3.5 122B A10B UD_Q4_k_XL and while it works great, I really wish I had enough ram for Q6 or even Q8. The speed difference won't be wildly worse, but the quality of output is noticeable. I'm just glad that it works as well as it does in Q4. It's mostly limited by my main ram bandwidth, the GPU helps but I only barely hit 15t/s decode with MTP hitting >80%.

[–] troed@fedia.io 2 points 1 month ago (1 children)

15t/s is workable IMHO. What's your system specs? I have 96GB DDR5 but never thought about going to an ever higher MoE.

[–] brockhold@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

Ryzen 5950X with a slight underclock
2x32+2x16 DDR4 2666
Radeon 7900XTX 24GB with a 250W power limit
15tps is entirely usable as an agent, but I can't go above 131k ctx and must set --parallel 1 to fit in available RAM+VRAM. It actually still OOMs periodically, as even with nothing else running it only has a couple GB to work with. If that happens with context mostly filled then you're looking at ten minutes of prompt processing before next token.

I'm not about to buy another 64GB pair to replace the 16s, but... I wish I had done that years ago.