troed

joined 3 years ago
 

I've just published an Opencode plugin I made since I couldn't find one that fulfilled the exact use case I had myself. Publishing and posting in case it's useful for someone else too.

When working with local models I have the need to know why there's suddenly nothing happening (besides the blue cylon-bar) but the plugins I found only showed token generation data when something was passed into Opencode. That meant that during >1 minute prefills there was no output at all.

This plugin uses the /slots endpoint (enabled by default) in llama-server to deduce whether it's currently generating tokens or doing prompt processing, and also the current tps for that activity. Now I can just run llama-server as a daemon and I no longer feel the need to go inspect its output just to see what's up.

It's likely only useful in a single-user scenario, but it has been tested with both single and multiple parallel slots.

Installation:

opencode plugin @troed/oc-ls-stats@latest --global

[–] troed@fedia.io 4 points 1 week ago (1 children)

Yeah I'm more thinking about the four wheel drive Gen 4.

[–] troed@fedia.io 22 points 1 week ago (5 children)

They'll never be allowed to compete head to head. It's like Formula E, that will always be enough different from Formula 1 so as to never show what a pure matchup would look like.

 

Since I like having more than one local LLM to switch between when analysing tricky development issues I decided to try out this new MoE model today. It's a 30B A3B which means it's basically a drop-in replacement for Qwen 3.6 35B A3B with suitable llama.cpp parameters the same.

On their own published benchmark metrics it's supposed to be slightly worse than Qwen, but so far it's not something I've noticed. It's tuned to work well in Opencode which is how I'm running it as well.

Try it out, see how it works for you. I know that there are those who would rather use a Canadian than Chinese model in today's political climate and it does seem to perform better than Gemma 4 at least for me. Just don't forget to use the PR linked from unsloth's description until it has been merged into main.

[–] troed@fedia.io 2 points 2 weeks ago (1 children)

15t/s is workable IMHO. What's your system specs? I have 96GB DDR5 but never thought about going to an ever higher MoE.

 

Maybe it was just me, but in case others have done the same this post might help someone else too.

I have a workstation with plenty of CPU and system RAM, but I'm "GPU poor" in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.

I'm exclusively using this setup for development and system management tasks, and I've found Qwen 3.6 35B A3B to excel compared to other models. I don't have the VRAM to run the 27GB dense model, so I've spent time on getting the best usage out of the MoE.

Or so I thought. Since "everyone" says to use Unsloth UD-Q4_K_XL that's the quant I've been using, and I've gone a bit back'n'forth with MTP/no MTP, UB increase, mmproj since I've also started using a browser MCP etc.

Today I took another look at their quant chart and thought that since it's MoE maybe I could run Q5_K_S which would be a step up?

Well. Now I'm using Q6_K because it turns out I could run that with the exact same settings as I've optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I've already noticed how it's able to get out of loops while before I had to interrupt it sometimes.

This is my setup. I get >1000 t/s prefill and >20 t/s inference. I'm not chasing faster inference since I actively read the thought process when working the LLM - but I've increased ub to get faster prefill since that's just waiting time otherwise.

./llama-server
    -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
    -c 160000 \
    -n 32768 \
    -fa on \
    -ub 2048 \
    -ctk q8_0 \
    -ctv q8_0 \
    --no-mmap \
    --mlock \
    --no-warmup \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0

I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.

[–] troed@fedia.io 18 points 1 month ago (3 children)

It's similar to being an assembler coder when higher level languages with compilers came. No need for management purging, you'll simply be competing for a smaller segment of assigments.

I don't know of a single developer that has actually used LLM aids say there's no benefit to them. Those that refuse do so for some other convictions and don't really know the difference between LLM aiding in tasks and full on yolo vibe coding.

[–] troed@fedia.io -3 points 1 month ago

Which I've done, and not a single person has looked them up. The reason for that is that no one here is actually interested in the subject - they just cannot accept their feels about humans being special snowflakes not having any support in the science.

[–] troed@fedia.io -1 points 1 month ago

I've sourced two of the foremost specialists on the subject. Blackmore's "Consciousness: An Introduction" amounts to a full university semester on the subject. No, I don't really see it as my job to condense that down in a post here. Anyone who's actually interested can start with reading up summaries that are available freely online instead of posting bad takes at me.

[–] troed@fedia.io -2 points 1 month ago (1 children)

Oh I haven't seen a single person replying so far who has shown any interest in being "better informed".

[–] troed@fedia.io -2 points 1 month ago (3 children)

Yes, as I've described here: https://blog.troed.se/posts/the-delta-between-an-llm-and-consciousness/

I didn't say human brains function like LLMs

Today's LLMs are based on a Google research paper from 2017. Another published paper that would solve this was published by Google in december last year: https://aipapersacademy.com/nested-learning-hope/

[–] troed@fedia.io -1 points 1 month ago (2 children)

Not a single person who has commented is interested in an actual discussion regarding the science on consciousness. It's all this: https://blog.troed.se/posts/the-coming-cognitive-disbelief/

[–] troed@fedia.io -1 points 1 month ago (4 children)

I don't care. See how easy it is? Either you're interested in the subject and you would already know that what I wrote is completely uncontroversial, or you spend time making ignorant posts because a simple fact disagrees with your feels.

[–] troed@fedia.io -1 points 1 month ago (10 children)

The difference between you and me is that I've studied the subject. You have not. It's not on me to teach you the contents of the literature.

Go be annoying somewhere else.

[–] troed@fedia.io -1 points 1 month ago (13 children)

I don't really care much for what you think - I already sourced two well known experts on the subject in another post in this thread.

 

74% of Ukrainians support fighting Russia even without U.S. assistance. A significant majority—59% of respondents—also believe that Ukraine can defeat Russia on the battlefield

only 6% of respondents said they were willing to make territorial concessions regarding areas occupied by Russia after the full-scale invasion in 2022

Additionally, 70% of respondents are against lowering the mobilization age,

Original article is paywalled, quotes from https://ukrainetoday.org/74-of-ukrainians-ready-to-resist-russia-without-u-s-aid-support-zelenskyys-actions/

 

We're consolidating our social media presence due to limited resources and no longer posting on Mastodon. Follow us on Reddit

Please tell us that you're not moving away from Lemmy/Mbin too. There's a gigantic tonedeafness to asking your supporters to use centralized social media at this specific time that's hard to accept you're not realizing.

(quote from Proton's mastodon.social account info - there wasn't even a post made about it)

 

Swedish author and famous pro-Ukraine blogger Lars Wilderäng (Cornucopia) reports today that the Swedish security expert Karl Emil Nikka has revealed that Kagi is using the Kremlin propaganda tool Yandex as a backend for searches.

Wilderäng speculates this might mean search terms are leaking to Russia, while others worry about how Kremlin thus can get their talking points into western search results.

Security expert Karl Emil Nikka tells us that the search engine Kagi, popular among tech geeks, uses Russian Yandex, which was introduced after the full-scale invasion. This, of course, gives Russia the opportunity to look at what is searched for via Kagi.

Link (in Swedish), see 11:22 update: https://cornucopia.se/2024/10/uppdateras-ryssland-medger-bruk-av-c-stridsmedel-mot-ukraina-rysk-pilot-som-mordade-68-ukrainare-ihjalslagen-med-hammare-bland-de-allra-storsta-ryska-forlusterna-under-kriget-igar/

view more: next ›