overview for troed

Self-hosted fully working ST:TNG voice computer interface in c/startrek@startrek.website

[–] troed@fedia.io 3 points 2 days ago

Qwen3-TTS does excellent cloning and is very fast. On my workstation (5060Ti) it renders the audio at around 2.5x realtime. I've added an example with my parameters now.

Self-hosted fully working ST:TNG voice computer interface in c/startrek@startrek.website

[–] troed@fedia.io 2 points 2 days ago

I've added example scripts for client, server and MCP configs. It should be much easier to replicate with those now.

Self-hosted fully working ST:TNG voice computer interface in c/startrek@startrek.website

[–] troed@fedia.io 16 points 2 days ago (1 children)

Oh thanks all for the nice comments! My fork is at https://github.com/troed/speech-to-speech but beware it's not kept in any other state than what I've last pushed :) Also you'll have to supply your own chime wavs and voice sample etc.

105

Self-hosted fully working ST:TNG voice computer interface (video.troed.se)

submitted 3 days ago by troed@fedia.io to c/startrek@startrek.website

24 comments fedilink

This has been my dream since I was a teen - and now the tech is real. This is a fork of an existing speech-to-speech project by Huggingface, with a lot of additions to add the right chimes, the voice* and the overall wakeword architecture.

I've had it running for a day and ... I like it. I'll most likely wire up a version of this throughout the house so the whole family can live the trekkie life :)

*) Cloning Barrett's voice ... I know. It's just little old me showing that we now have the tech, not some commercial venture. It's done with the utmost of respect for her craft throughout all those years!

Field note: measuring your context budget *before* a run, not after in c/localllama@sh.itjust.works

[–] troed@fedia.io 1 points 4 days ago

When I start llama-server I point it to the models-config that have unique max context sizes per model - and they're allocated at their max size as soon as the server starts so since it comes up I will be able to use that context size too.

I'm actually a bit unsure as to how you run it since you get OOMs during usage :)

I also use the DCP plugin for Opencode to help manage the context cache and have less of a disruption as it gets compressed, but I wouldn't need to for the above to work. When I hit the context limit the context would still get compressed.

[Early release] Bazinga - web service and CLI to simplify pairing, waking and Steam user account switching for Remote Play in c/bazzite@lemmy.world

[–] troed@fedia.io 1 points 6 days ago

Less-than-beautiful UI added to the clients (still works as CLI if run with parameters). Somewhat problematic finding an easily cross-compileable Go UI lib.

Also made the typical pipe-to-bash installation flow, awaiting proper ujust in some far away future.

12

[Early release] Bazinga - web service and CLI to simplify pairing, waking and Steam user account switching for Remote Play (git.sync.wtf)

submitted 6 days ago by troed@fedia.io to c/bazzite@lemmy.world

1 comments fedilink

We have a Bazzite box that different family members often use for streaming to their own laptops/computers from. It's cumbersome having to walk to the room it's in to wake it from sleep or log in their user into Steam on so that a Remote Play session can be started.

Created a little utility that runs a web service on the Bazzite box that can switch the logged in Steam user (guardrailed so that no switching can be done if someone is currently playing).

Additionally the binary (Windows, macOS and Linux) can run as a cli to do the switching, as well as sending a Wake-on-Lan packet and perform Steam Link connect PIN pairing.

Note: This is very early, only tested on our own system at the moment. It might also work on plain SteamOS with very little adaptions, that is completely untested though since I don't have such a machine.

Source repo for building (no ready binaries until I know it works for others too): https://git.sync.wtf/troed/bazinga

llama rpc exists and works really well in c/localllama@sh.itjust.works

[–] troed@fedia.io 3 points 1 week ago

Regular 1Gbit/s with three switches in total between them - distance about 40m (different buildings).

22

llama rpc exists and works really well (github.com)

submitted 1 week ago by troed@fedia.io to c/localllama@sh.itjust.works

5 comments fedilink

I've been back'n'forth for a while whether to invest in a second 5060Ti to be able to go dual GPU on my main workstation. The target has always been to be able to run Qwen 3.6 27B at a usable model quant, kv-cache quant and context size.

Today I tried out llama.cpp with the rpc option. I sort of knew at the back of my head that there were projects doing cross-computer inference but I didn't know it was built into llama.cpp.

Sure, this requires that you have another machine on your local network (latency matters) with a GPU - but in my case I actually have two. One server that has a 12GB A2000 and one Bazzite box with an AMD something also with 12GB VRAM.

I just compiled the latest llama.cpp HEAD with the rpc option enabled on my headless server and launched my main llama-server pointing to it. With the 12GB of the A2000 in the mix I'm now able to comfortably run the 27B model Q4_K_M, Q8_0 kv-cache and 140000 context size, and I've still not maxed out the available VRAM.

PP at ~400 (a bit low, will see if that can be tuned*) and TG at 25 (MTP and ngram enabled) makes this perfectly usable. That's the token/second during inference I would've expected from a dual GPU setup in the workstation anyway.

So, if you also happen to have another machine on your network with spare VRAM - try out llama rpc. I have no idea why it took this long for me to do.

*) This is literally my first run with random defaults - tuning starts now

Telenor köper Bahnhof in c/sweden@lemmy.world

[–] troed@fedia.io 9 points 1 week ago

Jag har hittills haft åsikten att vad jag än kan välja för bredband så väljer jag (och rekommenderar andra) Bahnhof eftersom oavsett om de är billigast eller inte så vet jag att mina pengar går till något bra för oss alla.

Med Telenor som ägare kommer jag nu istället gå strikt på features vs kostnad. Antingen levererar de bäst eller inte.

Mullvad on the Donation Controversy in c/privacy@lemmy.world

[–] troed@fedia.io 5 points 1 week ago

I think this hadn't been made a big deal out of if it hadn't been for the conspiracy theories on Proton just before

Stop Using OpenCode in c/hackernews@lemmy.bestiver.se

[–] troed@fedia.io 3 points 1 week ago

Yeah it was quite obvious from all the "clanker" that this is a typical run-of-the-mill AI-hater, but I found this funny anyway:

The weight count is too low to reproduce the training set verbatim

That's not what anyone wants. That's not how LLMs gain "intelligence", at all. The whole point of training on large datasets is to NOT internalize training data verbatim.

Ukrainian parliament appoints new Cabinet, delays decision on defense minister in c/ukraine@sopuli.xyz

[–] troed@fedia.io 6 points 2 weeks ago

Oleksandr Kravchenko, a partner at U.S. consulting firm McKinsey & Company, will be Ukraine's economy and ecology minister

o_O

Qwen 27B Q4 at usable speed on 16GB VRAM in c/localllama@sh.itjust.works

[–] troed@fedia.io 6 points 2 weeks ago

The trick in itself (the FFN tensors selectively being off GPU) shouldn't be.

Qwen 27B Q4 at usable speed on 16GB VRAM in c/localllama@sh.itjust.works

[–] troed@fedia.io 10 points 2 weeks ago

Not sure I understand but I'm on Linux fwiw.

31

Qwen 27B Q4 at usable speed on 16GB VRAM (www.reddit.com)

submitted 2 weeks ago by troed@fedia.io to c/localllama@sh.itjust.works

5 comments fedilink

Not my settings! All credits to "Stainless-Bacon". I did however replicate the setup just now and I think it's worthy of spreading.

The trick is in a targeted offloading of only a specific kind of layers to CPU making it possible to run a higher quant that otherwise would be too slow to be usable.

Remember to compile llama.cpp with GGML_CUDA_FA_ALL_QUANTS=ON

Original settings below. Since I'm using the iGPU for the system I'm slightly tweaking them to use even more VRAM. On my 5060Ti I'm getting prefill 500-600tps (ub at 2048) and tg at ~10tps.

export GGML_CUDA_DISABLE_GRAPHS=1
llama-server \
  --model Qwen3.6-27B-Q4_K_M_MTP.gguf \
  --chat-template-file froggeric_fix.jinja \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --jinja \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --fit off \
  --n-gpu-layers 99 \
  --override-tensor 'blk\.(2[0-9]|3[0-9]|4[0-3])\.ffn_.*=CPU' \
  --ctx-size 96000 \
  --batch-size 512 \
  --ubatch-size 512 \
  --cache-type-k q5_0 \
  --cache-type-v q4_1 \
  --parallel 1 \
  --temp 0.60 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --flash-attn on \
  --no-mmap \
  --host 0.0.0.0

6

My llama-server suddenly started error 400 on the chat template - this fixed it (huggingface.co)

submitted 2 weeks ago by troed@fedia.io to c/localllama@sh.itjust.works

0 comments fedilink

I regularly recompile llama.cpp but after having done so today my existing setup stopped working. llama-server started throwing an error 400 on the chat template in the Ornith 1.0 35B model I'm mostly using.

Couldn't find anything (except a closed stale issue on llama.cpp that seemed slightly similar) so I followed the advice from another posted there and installed the linked "fixed" Qwen chat templates.

That worked, so, just posting this here in case it might help someone else too.

5

GPU bifurcation - options (fedia.io)

submitted 3 weeks ago by troed@fedia.io to c/localllama@sh.itjust.works

4 comments fedilink

I have an HP Z2 G9 where I've replaced the original A2000 GPU with a 5060Ti. I'm pondering getting another 5060Ti to be able to go up to dense models instead of MoE, but the PCIe options on this motherboard aren't the best. I will need to use a PCIe extender no matter what I opt for since there's physically no room for another GPU (the 5060 I have is 2.5 slot sized) and so I'm wondering whether I would even gain something from using an M2 slot adapter instead.

Any insight welcome.

(1) PCI Express Gen5 slot x16 mechanical/ x16 electrical (full height, full length)
(1) PCI Express Gen3 slot x4 mechanical/ x1 electrical (full height, full length, open-ended)
(1) PCI Express Gen3 slot x16 mechanical/ x4 electrical (full height, full length)
(1) PCI Express Gen3 slot x4 mechanical/ x4 electrical (full height, full length, open-ended)
(1) M.2 2280 Storage (PCIe Gen4 x4)
(1) M.2 2280 Storage (PCIe Gen4 x4)
(1) M.2 2280 Storage (PCIe Gen4 x4)

17

Opencode llama-server prefill/generation stats plugin (codeberg.org)

submitted 1 month ago by troed@fedia.io to c/localllama@sh.itjust.works

3 comments fedilink

I've just published an Opencode plugin I made since I couldn't find one that fulfilled the exact use case I had myself. Publishing and posting in case it's useful for someone else too.

When working with local models I have the need to know why there's suddenly nothing happening (besides the blue cylon-bar) but the plugins I found only showed token generation data when something was passed into Opencode. That meant that during >1 minute prefills there was no output at all.

This plugin uses the /slots endpoint (enabled by default) in llama-server to deduce whether it's currently generating tokens or doing prompt processing, and also the current tps for that activity. Now I can just run llama-server as a daemon and I no longer feel the need to go inspect its output just to see what's up.

It's likely only useful in a single-user scenario, but it has been tested with both single and multiple parallel slots.

Installation:

opencode plugin @troed/oc-ls-stats@latest --global

10

North Mini Code v1.0 - a Qwen 3.6 35B MoE alternative (huggingface.co)

submitted 1 month ago by troed@fedia.io to c/localllama@sh.itjust.works

6 comments fedilink

Since I like having more than one local LLM to switch between when analysing tricky development issues I decided to try out this new MoE model today. It's a 30B A3B which means it's basically a drop-in replacement for Qwen 3.6 35B A3B with suitable llama.cpp parameters the same.

On their own published benchmark metrics it's supposed to be slightly worse than Qwen, but so far it's not something I've noticed. It's tuned to work well in Opencode which is how I'm running it as well.

Try it out, see how it works for you. I know that there are those who would rather use a Canadian than Chinese model in today's political climate and it does seem to perform better than Gemma 4 at least for me. Just don't forget to use the PR linked from unsloth's description until it has been merged into main.

28

Don't skimp on the quant when using MoE (unsloth.ai)

submitted 2 months ago by troed@fedia.io to c/localllama@sh.itjust.works

3 comments fedilink

Maybe it was just me, but in case others have done the same this post might help someone else too.

I have a workstation with plenty of CPU and system RAM, but I'm "GPU poor" in that I only have a 5060Ti with its 16GB of VRAM. Additionally, I need to use the GPU for regular system activities too which means I only have around ~14GB of VRAM available for the LLM.

I'm exclusively using this setup for development and system management tasks, and I've found Qwen 3.6 35B A3B to excel compared to other models. I don't have the VRAM to run the 27GB dense model, so I've spent time on getting the best usage out of the MoE.

Or so I thought. Since "everyone" says to use Unsloth UD-Q4_K_XL that's the quant I've been using, and I've gone a bit back'n'forth with MTP/no MTP, UB increase, mmproj since I've also started using a browser MCP etc.

Today I took another look at their quant chart and thought that since it's MoE maybe I could run Q5_K_S which would be a step up?

Well. Now I'm using Q6_K because it turns out I could run that with the exact same settings as I've optimized my Q4_K_XL setup for which means there are no drawbacks - just a better performing model. I've already noticed how it's able to get out of loops while before I had to interrupt it sometimes.

This is my setup. I get >1000 t/s prefill and >20 t/s inference. I'm not chasing faster inference since I actively read the thought process when working the LLM - but I've increased ub to get faster prefill since that's just waiting time otherwise.

./llama-server
    -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K \
    -c 160000 \
    -n 32768 \
    -fa on \
    -ub 2048 \
    -ctk q8_0 \
    -ctv q8_0 \
    --no-mmap \
    --mlock \
    --no-warmup \
    --chat-template-kwargs '{"preserve_thinking": true}' \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20 \
    --min-p 0.0 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.0 \
    --host 0.0.0.0

I also use Opencode with the DCP and Superpowers plugins, which make a tremendous difference both to context handling as well as planning. I have no need for a larger context - I even compact early quite often since the tasks get done before reaching the limit.

196

The majority of Ukrainians are ready to fight Russia without U.S. support (www.economist.com)

submitted 1 year ago by troed@fedia.io to c/globalnews@lemmy.zip

21 comments fedilink

74% of Ukrainians support fighting Russia even without U.S. assistance. A significant majority—59% of respondents—also believe that Ukraine can defeat Russia on the battlefield

only 6% of respondents said they were willing to make territorial concessions regarding areas occupied by Russia after the full-scale invasion in 2022

Additionally, 70% of respondents are against lowering the mobilization age,

Original article is paywalled, quotes from https://ukrainetoday.org/74-of-ukrainians-ready-to-resist-russia-without-u-s-aid-support-zelenskyys-actions/

98

Are you still here Proton? (fedia.io)

submitted 1 year ago by troed@fedia.io to c/protonprivacy@lemmy.world

26 comments fedilink

We're consolidating our social media presence due to limited resources and no longer posting on Mastodon. Follow us on Reddit

Please tell us that you're not moving away from Lemmy/Mbin too. There's a gigantic tonedeafness to asking your supporters to use centralized social media at this specific time that's hard to accept you're not realizing.

(quote from Proton's mastodon.social account info - there wasn't even a post made about it)

69

Kagi search engine working with Russia (fedia.io)

submitted 2 years ago by troed@fedia.io to c/privacy@lemmy.ml

42 comments fedilink

Swedish author and famous pro-Ukraine blogger Lars Wilderäng (Cornucopia) reports today that the Swedish security expert Karl Emil Nikka has revealed that Kagi is using the Kremlin propaganda tool Yandex as a backend for searches.

Wilderäng speculates this might mean search terms are leaking to Russia, while others worry about how Kremlin thus can get their talking points into western search results.

Security expert Karl Emil Nikka tells us that the search engine Kagi, popular among tech geeks, uses Russian Yandex, which was introduced after the full-scale invasion. This, of course, gives Russia the opportunity to look at what is searched for via Kagi.

Link (in Swedish), see 11:22 update: https://cornucopia.se/2024/10/uppdateras-ryssland-medger-bruk-av-c-stridsmedel-mot-ukraina-rysk-pilot-som-mordade-68-ukrainare-ihjalslagen-med-hammare-bland-de-allra-storsta-ryska-forlusterna-under-kriget-igar/