this post was submitted on 03 Jul 2026
468 points (99.2% liked)
Technology
86046 readers
3230 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I was under the impression you keep loading the model into VRAM and unloading it when finished using it, I meant it's less power efficient than just keeping it in VRAM.
Thing is, the input/reading part of it is cheap and wastefully generating extra tokens as output costs you more in energy (or money if using an external service). Put it this way: Claude has historically had 3 models: Haiku (small), Sonnet (medium), Opus (big). Sonnet 5 came out recently and people using Claude Code have reported that it's so verbose, it's now more expensive to use for the same task than Opus, which has much bigger costs per Mtok. That would mean it probably also uses more energy than the bigger model.
At that point, why bother with a local model, you could use Deepseek V4 flash and probably spend less than a tenner a month on it. It's surprisingly capable (I mean sometimes you can barely tell it's not a frontier model) and costs next to nothing to use.
It's sort of what my workflow does when I use OpenCode. Bigger model (GLM-5.2 or GPT-5.5 depending on which one hasn't run into its usage limit) reads my prompt, the .md files describing the repo and the overall file structure of the repo, then fires off parallel DeepSeek V4 Flash scouts on usage credits to read and summarize the files as needed. The big model then does the planning and again DeepSeek V4 Flash is the one to execute it via subagents. The subagents running DeepSeek usually come back with 1-2 cents in cost.
I did try a Qwen-3.6 distillation locally and it was pretty capable in terms of output, but it's more expensive for me than the DeepSeek Flash on API usage costs, since electricity isn't free here and my GPU is 2 generations old. And it's slow as hell, since it has to offload a lot to CPU/RAM over GPU/RAM.
The big models I only use as subscriptions that I'm prepared to end at any moment if they reduce the usage I get. Let the AI companies eat the cost, I'll never pay them API pricing if they want 20 or 30 dollars for a million output tokens.