Maybe it was just me, but in case others have done the same this post might help someone else too.
I have a workstation with plenty of CPU and system RAM, but I'm "GPU poor" in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.
I'm exclusively using this setup for development and system management tasks, and I've found Qwen 3.6 35B A3B to excel compared to other models. I don't have the VRAM to run the 27GB dense model, so I've spent time on getting the best usage out of the MoE.
Or so I thought. Since "everyone" says to use Unsloth UD-Q4_K_XL that's the quant I've been using, and I've gone a bit back'n'forth with MTP/no MTP, UB increase, mmproj since I've also started using a browser MCP etc.
Today I took another look at their quant chart and thought that since it's MoE maybe I could run Q5_K_S which would be a step up?
Well. Now I'm using Q6_K because it turns out I could run that with the exact same settings as I've optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I've already noticed how it's able to get out of loops while before I had to interrupt it sometimes.
This is my setup. I get >1000 t/s prefill and >20 t/s inference. I'm not chasing faster inference since I actively read the thought process when working the LLM - but I've increased ub to get faster prefill since that's just waiting time otherwise.
./llama-server
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
-c 160000 \
-n 32768 \
-fa on \
-ub 2048 \
-ctk q8_0 \
-ctv q8_0 \
--no-mmap \
--mlock \
--no-warmup \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--host 0.0.0.0
I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.