this post was submitted on 25 Apr 2026
0 points (NaN% liked)

Selfhosted

60253 readers
483 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Detailed Rules Post

  1. Be civil.

  2. No spam.

  3. Posts are to be related to self-hosting.

  4. Don't duplicate the full text of your blog or readme if you're providing a link.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require active participation, with an account that is at least 30 days old. F/LOSS without a paywall has exceptions, with requirements. See the rules link for details.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

I'm looking to build a low-end ollama LLM server to improve home assistant voice control, Immich image recognition and a few other services. With the current cost of hardware components like memory, I'm looking to build something small, but somewhat expandable.

I have an old micro-atx form factor computer that I'm thinking will be a good option to upgrade. I'd love recommendations on motherboards, processors, and video card combos that would likely be compatible and sufficient to run a decent server while keeping costs lower, basically, the best bang for the buck. I have a couple of M.2 SSDs I can re-purpose. Would prefer the motherboard has 2.5Gbit Ethernet, but otherwise I'm open.

Also recommendations on sites to purchase good quality memory at reasonable prices that ship to the US. I'd be willing to look at lightly used components, too.

Any advice on any of these topics would be greatly appreciated. The advice I've found has all been out of date especially with crypto fading so video cards are not as expensive, but LLM data centers eating up and reserving memory before it's even manufactured.

top 26 comments
sorted by: hot top controversial new old
[–] vegetaaaaaaa@lemmy.world 1 points 2 months ago* (last edited 2 months ago) (1 children)

I suggest using llama.cpp instead of ollama, you can easily squeeze +10% in inference speed and other memory optimizations from llama.cpp. With hardware prices nowadays I think every % saved on resources matters. Here is a simple ansible role to setup llama.cpp, it should give you a good idea of how to deploy it.

A dedicated inference rig is not gonna be cheap. What I did, since I need a gaming rig; is getting 32GB DDR5 (this was before the current RAMpocalypse, if I had known I would have bought 64) and an AMD 9070 (16GB VRAM - again if I had known how crazy prices would get I'd probably ahve bought a 24GB VRAM card). The home server runs the usual/non-AI stuff, and llamacpp runs on the gaming desktop (the home server just has a proxy to it). Yeah the gaming desktop has to be powered up when I want to run inference, this is my main desktop so it's powered on most of the time, no big deal

[–] adeoxymus@lemmy.world 2 points 2 months ago (1 children)
[–] Freeposity@lemmy.world 1 points 2 months ago

Thanks for this. I'm definitely dropping ollama now. No wonder GGUF models always gave me issues.

I might even dump open webui for llama.cpp's webui

[–] clifmo@programming.dev 1 points 2 months ago (1 children)

Until you tell us what your budget is, I'm not sure there's much to discuss. You're talking about motherboards. So I guess your choice right now comes down to Strix halo or not?

[–] irotsoma@piefed.blahaj.zone 1 points 2 months ago (1 children)

Definitely not needing something that high-end. It's just me and maybe one other person using it periodically for voice commands that needs to be realtime. The rest is background processed stuff like Immich image recognition and Jellyfin audio/video processing. Nothing fancy is needed. I mention motherboard because the system I'm thinking of using is currently running Plex which I'm in the process of replacing with Jellyfin on my Kubernetes cluster of minipcs and Raspberry pis that runs most of my stuff pretty well, but could benefit from dedicated LLM/ML. So, that machine will be freed up, but it's nearly a decade old and not up to the task as it is.

As for specific budget, I don't have specifics in mind. My Kubernetes cluster is super energy efficient since it's all small systems that only spin up when needed. So thinking about overall cost of ownership vs benefit. Having something too high end would just waste energy as well as the initial investment.

[–] radieschen@slrpnk.net 1 points 2 months ago

You could consider something from the Radxa Rock 5 series.

[–] nitrolife@hikki.team 1 points 2 months ago* (last edited 2 months ago) (2 children)

not a very popular opinion, but if you want an inexpensive, really inexpensive variant, take the AMD MX9070XT. AMD is not the most popular AI cards, but they are not bad with ROCm and for the price of 5090 you can put 5 cards (80 GB vram)

[–] a_fancy_kiwi@lemmy.world 1 points 2 months ago* (last edited 2 months ago)

I agree. I’ve got a 9060XT 16GB card running some version of gpt-oss:20b. I understand how to program more or less but I do it so infrequently that I forget the syntax of whatever language I’m working in. It’s ability to spit out boiler plate code that I can edit for my needs has been a huge time saver and I’m extremely happy with my setup.

[–] WbrJr@lemmy.ml 0 points 2 months ago (1 children)

Not all programs allow usage of multiple gpus as far as I know, some are not capable of splitting the llm in multiple vrams or something

[–] nitrolife@hikki.team 1 points 2 months ago* (last edited 2 months ago)

Yes, it is. But I have llama-swap, openweb-ui. If you spend some time on the llama-swap configuration, then the you have a good chance to run the model on 2 cards is through llama.cpp. The winnings, however, will not be x2 of course and will fall non-linearly from the number of cards. And you need motherboard with good PCI-E lines (2 pci-e x16 or more). But it's still cheaper than one large card. Example:

HIP_VISIBLE_DEVICES=0,1 \
/opt/llama.cpp/build/bin/llama-server \
  --host 127.0.0.1 \
  --port 8082 \
  --model /storage/models/model.gguf \
  --n-gpu-layers all \
  --split-mode layer \
  --tensor-split 1,1 \
  --ctx-size 32768 \
  --batch-size 512 \
  --ubatch-size 512 \
  --flash-attn on \
  --parallel 1

There is a less stable but more productive one: --split-mode row

P.S. By the way, one RX9070XT on my instance translates posts and comments. You can test it if you want. =)

[–] chrash0@lemmy.world 0 points 2 months ago (2 children)

honestly it’s hard to beat Macs these days in this space for two reasons:

  • unified memory means that you don’t have to load up on RAM just to load the model and then also shell out for a video card with barely enough VRAM to fit a basic language model
  • their supply chain is solid and has mostly avoided the constraints that other OEMs and parts manufacturers are struggling with

pricing is tough. sure, crypto is on its way out, but GPUs are still the platform of choice for most neural net workloads (outside of SoCs like Apple M-series). i built a PC in late 2024, and it’s easily worth twice what i paid for it.

[–] irotsoma@piefed.blahaj.zone 1 points 2 months ago (3 children)

Yeah,but I dont want to get locked into a proprietary OS or have to put a lot of effort into hacking it to run Linux.

[–] p4rzivalrp2@piefed.social 1 points 1 month ago (1 children)

The framework desktop has unified memory iirc, and that can obviously use any os

[–] irotsoma@piefed.blahaj.zone 1 points 1 month ago (1 children)

I didn't realize they were making desktops. I almost bought a laptop from them a few years ago but ended up finding an ASUS laptop that worked well with Linux and was significantly cheaper which fit my needs better for that. I'll check them out.

[–] p4rzivalrp2@piefed.social 1 points 1 month ago (1 children)

Yeah they have soldered ram and cpu but have stays halo so ig that's fine

[–] irotsoma@piefed.blahaj.zone 0 points 1 month ago (1 children)

So just glancing at the site, are these basically laptop CPU and RAM parts just packaged in a desktop form-factor case and that's why they're soldered? Seems like they also don't have much expansion capability much like a laptop such as only having a single PCI-E x4 slot with a proprietary connection interface, so I couldn't later add a graphics card for example. Unless, I'm just missing something, and if so please let me know.

Either way thanks for letting me know about the option.

[–] p4rzivalrp2@piefed.social 1 points 1 month ago (1 children)

The main benefit is the strix halo cpu uses unified memory, thats why it's soldered, not bc it uses laptop parts

[–] irotsoma@piefed.blahaj.zone 0 points 1 month ago (1 children)

Ok, so short, wide bus from CPU to memory? Makes sense. I didn't really mean the CPU so much as the main board is very laptop like. Very little expansion capabilities other than external connectors like audio, Ethernet, etc., but no ability to add functional or incremental upgrades like a GPU or an additional stick of memory respectively.

[–] p4rzivalrp2@piefed.social 1 points 1 month ago

The point of the stays halo series is the unified memory, so an additional GPU wouldn't be very useful, no?

[–] WASTECH@lemmy.world 1 points 2 months ago (1 children)

I haven’t looked into Asahi Linux in a while now, but I figured the experience would be pretty good by now. You don’t need to “hack” anything to get it to run. Last I read, there were just a few driver issues, but I haven’t looked into it in probably 2-3 years now.

[–] SpatchyIsOnline@lemmy.world 1 points 2 months ago

Last time I checked it only runs well on M1 devices, with M2 being somewhat usable. M3 though M5 are a complete no-go unfortunately :(

[–] ryokimball@infosec.pub 0 points 2 months ago

The apple silicon is more energy efficient but the latest Intel and AMD CPUs deliver more processing power and can also share a significant amount of RAM to the GPU / AI components.

[–] Scipitie@lemmy.dbzer0.com 1 points 2 months ago (1 children)

Depends what you want to do... For example I didn't get python whisper in a container to run on Mac in any way that can be called "performance" and I don't want my dev workflow to optimize for an OS I despise :D

[–] chrash0@lemmy.world 0 points 2 months ago (1 children)

in a container

well there’s your issue. i get not liking the OS, but actively crippling your project will cripple your project.

containers on macOS do kinda suck

[–] Scipitie@lemmy.dbzer0.com 0 points 2 months ago (1 children)

That's sich a Mac answer it's unbelievable.

Describing "A project aimed to be agnostic of it's environment" as a design mistake and not a inherent flaw of the OS is... Just wow.

Remember in this thread it's about the pro and con of Macos as interference hardware. This is a major flaw which comes baked into the hardware. I tested it and find it an unacceptable limitation. It's important for others to know.

To state "containerization is the issue" though... Just wow.

[–] JadedBlueEyes@programming.dev 0 points 2 months ago

Unfortunately containerisation on macos usually means running virtualized Linux, which of course is going to add overhead and cut off access to apple APIs and some hardware. So yep. There's plenty that runs natively.