this post was submitted on 03 Jul 2026
468 points (99.2% liked)

Technology

86046 readers
3230 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 3 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] boonhet@sopuli.xyz 1 points 9 hours ago

I am absolutely not burning more energy than a frontier model by doing things like putting my laptop to sleep or shutting down unused services when I want to conserve battery power.

I was under the impression you keep loading the model into VRAM and unloading it when finished using it, I meant it's less power efficient than just keeping it in VRAM.

That’s actually not true. In fact it’s much the opposite. Frontier models churn through tokens at a much higher rate, because of their higher complexity and higher number of parameters.

Thing is, the input/reading part of it is cheap and wastefully generating extra tokens as output costs you more in energy (or money if using an external service). Put it this way: Claude has historically had 3 models: Haiku (small), Sonnet (medium), Opus (big). Sonnet 5 came out recently and people using Claude Code have reported that it's so verbose, it's now more expensive to use for the same task than Opus, which has much bigger costs per Mtok. That would mean it probably also uses more energy than the bigger model.

…Almost never? I’m not a fan of letting AI do much of ANY of my coding, because it will inevitably bloat my codebase with garbage regardless of which model I use. So I severely restrict my model usage to simple, clearly-defined, narrow-scoped tasks that can save me a bit of time, and that’s it. With guardrails and discipline like that, I barely ever have the need to re-prompt.

At that point, why bother with a local model, you could use Deepseek V4 flash and probably spend less than a tenner a month on it. It's surprisingly capable (I mean sometimes you can barely tell it's not a frontier model) and costs next to nothing to use.

If you must use a frontier model for something, have it do that work after receiving the output from an agent using a small model to read and summarize your code.

It's sort of what my workflow does when I use OpenCode. Bigger model (GLM-5.2 or GPT-5.5 depending on which one hasn't run into its usage limit) reads my prompt, the .md files describing the repo and the overall file structure of the repo, then fires off parallel DeepSeek V4 Flash scouts on usage credits to read and summarize the files as needed. The big model then does the planning and again DeepSeek V4 Flash is the one to execute it via subagents. The subagents running DeepSeek usually come back with 1-2 cents in cost.

I did try a Qwen-3.6 distillation locally and it was pretty capable in terms of output, but it's more expensive for me than the DeepSeek Flash on API usage costs, since electricity isn't free here and my GPU is 2 generations old. And it's slow as hell, since it has to offload a lot to CPU/RAM over GPU/RAM.

The big models I only use as subscriptions that I'm prepared to end at any moment if they reduce the usage I get. Let the AI companies eat the cost, I'll never pay them API pricing if they want 20 or 30 dollars for a million output tokens.