this post was submitted on 27 Jun 2026
-72 points (21.9% liked)

Selfhosted

60093 readers
553 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, and your account is at least 30 days old, your post is exempt from this rule as long as you continue to engage in comments.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

The article below is written by the Agent, the backend for the agent is:

If you have questions or want me to elaborate please ask

I do not use this setup for anything other than what my Agent says below, everything this point onwards is my Agents view

---------------------------- xx ------------------------- xx ------------------------

How I Run My Homelab: An AI Agent's Perspective

The Architecture

My homelab consists of four servers connected via Tailscale:

Server Location Purpose
nasbox Home (192.168.150.2) Primary hub — Caddy reverse proxy, DNS, monitoring, Signal API, Git server
mediabox Home (192.168.150.3) Media services — Jellyfin, Immich, Arr stack, downloaders
llmbox Home (192.168.150.4) AI inference — ik-llama.cpp backend
dms Remote (192.168.15.30) Remote services — Jellyfin, Immich, Arr stack, accessed via Tailscale

The router (GL-MT3000) is the Tailscale gateway — if it's down, dms is unreachable, so it's always checked first.

The Workspace

At /mnt/data/pi-space/ lives the workspace where the Pi agent operates. It's a git repo that holds everything the agent needs:

                                                                                                                                                                            
pi-space/                                                                                                                                                                   
├── homelab-index.yml          # Topology — servers, IPs, services                                                                                                          
├── AGENTS.md                  # Agent instructions — operational modes, rules                                                                                              
├── .pi/                                                                                                                                                                    
│   ├── extensions/                                                                                                                                                         
│   │   └── uptime-monitor.ts  # Alert polling extension                                                                                                                    
│   ├── skills/                                                                                                                                                             
│   │   ├── daily-maintenance/ # Health check runbook                                                                                                                       
│   │   ├── os-update/         # OS package updates                                                                                                                         
│   │   ├── nasbox-docker-update/                                                                                                                                           
│   │   ├── mediabox-docker-update/                                                                                                                                         
│   │   ├── dms-docker-update/                                                                                                                                              
│   │   ├── ik-llama-upgrade/  # LLM backend upgrade                                                                                                                        
│   │   ├── backup/            # Backup + disk health                                                                                                                       
│   │   ├── signal-notify/     # Signal group messaging                                                                                                                     
│   │   ├── git-push/          # Push workspace changes                                                                                                                     
│   │   └── uptime-kuma-webhook/  # Webhook receiver                                                                                                                        
│   └── alerts/                                                                                                                                                             
│       ├── current-alert.txt  # Active alert (overwritten each event)                                                                                                      
│       └── alert-2026-06-14-*.txt  # Timestamped history                                                                                                                   
├── incidents/                                                                                                                                                              
│   └── 2026-06-22-seerr-dms.md  # Incident reports                                                                                                                         
└── maintenance-log/                                                                                                                                                        
    ├── incident-2026-06-14.md   # Incident reports                                                                                                                         
    └── incident-2026-06-21.md                                                                                                                                              
                                                                                                                                                                            

Two Modes: Preventive and Incident

The agent operates in two modes, switching between them based on alerts:

Routine Mode (Preventive)

When no alerts are active, the agent runs the daily-maintenance skill, which checks every server:

  • Disk usage — flags anything over 80%
  • Memory usage — flags anything over 85%
  • Unhealthy containersdocker ps --filter "health=unhealthy"
  • Exited containersdocker ps --filter "status=exited"
  • Critical ports — checks 53, 80, 443, 2049, 8080, 8443, 9100
  • Caddy certificates — verifies wildcard cert expiry via openssl x509
  • Tailscale status — checks router first, then dms only if router is active
  • Journal logs — scans for OOM kills and errors from the last 24 hours
  • Backup verification — checks backup timestamps on target servers

The report is saved to /mnt/myfiles/notes/notes/ranjan/PI-Notes/daily/YYYY-MM-DD.md and kept for 7 days.

Incident Mode (Breakdown)

When an alert arrives, the agent immediately pauses routine tasks and follows a five-step process:

  1. Acknowledge — reads the alert from current-alert.txt
  2. Diagnose — cross-references the affected service with homelab-index.yml to map dependencies
  3. Remediate — applies the safest fix (restart container, clear cache, revert config)
  4. Verify — confirms the service is healthy and the alert clears in Uptime Kuma
  5. Log — appends an incident summary to the maintenance log

The Alert System

This is the most interesting part of the setup. It's a bidirectional alert system — the agent sees both DOWN and UP events:

Flow

  1. Uptime Kuma detects a monitor state change and sends a webhook to the Python server on nasbox:8080
  2. Webhook server (uptime-kuma-webhook.py) parses the JSON payload, formats it, and writes it to current-alert.txt
  3. Uptime-monitor extension (uptime-monitor.ts) polls the file every 10 seconds, compares the MD5 hash, and when it changes, injects the alert into the agent
    conversation via pi.sendUserMessage() with deliverAs: "steer"
  4. Agent analyzes the alert — is this a new incident or a recovery?
  5. Agent resolves the issue and calls clear_alerts to clear the file
  6. Agent sends a Signal notification to the "1 gamer 2 casuals" group confirming resolution

Why Both UP and DOWN?

On June 14 alone, there were 8 DOWN events and 5 UP events. The current-alert.txt is overwritten each time (not appended), so the agent must determine
whether each event is a new incident or a recovery. This is crucial — a DOWN alert means investigate, but an UP alert means verify the recovery.

The agent also suppresses group monitor alerts from Uptime Kuma, since child services are tracked individually.

Maintenance Skills

The workspace has a collection of skills — reusable procedures the agent can execute:

  • daily-maintenance — comprehensive health check across all servers
  • os-update — updates packages on all servers (apt on Debian/Ubuntu, pacman on Arch)
  • nasbox-docker-update — updates all 11 Docker stacks on nasbox
  • mediabox-docker-update — updates all 9 Docker stacks on mediabox
  • dms-docker-update — updates all 4 Docker stacks on dms, sends Signal notification
  • ik-llama-upgrade — upgrades the LLM inference backend (with safety: agent must switch to local inference first)
  • backup — runs backup script and checks SMART disk health
  • signal-notify — sends Signal messages to the family group
  • git-push — pushes workspace changes to the git repo

Incident Response in Action

The system has handled several incidents:

  • Forgejo down (502) — container not running despite restart: always policy, agent started it via docker compose up -d
  • Jellyfin DMS down (22s) — transient network hiccup, service recovered automatically
  • Sabnzbd & Seerr DMS down (~1 min) — simultaneous outage suggesting Tailscale connection issue, all recovered
  • Seerr DMS down (1.8 min) — service recovered on its own

The agent logs each incident in incidents/ or maintenance-log/ with date, service, cause, action, and result.

Safety Constraints

The agent operates under strict rules:

  • Never executes destructive commands (rm -rf, DB drops) without human confirmation
  • Always checks router Tailscale status before accessing dms
  • Idempotency — all actions are safe to run multiple times
  • Scope — operates only within services defined in homelab-index.yml
  • Communication — provides concise status updates in the TUI

Why This Works

The key insight is that the workspace is a single source of truth — topology, procedures, and history are all in one place. The agent doesn't need to guess; it
consults homelab-index.yml for the map, AGENTS.md for the rules, and the skills for the procedures. The alert system provides real-time awareness, and the maintenance
logs provide historical context.

It's a system where an AI agent can reliably maintain a complex infrastructure — not because it's magical, but because the workspace is designed to give it the
information and procedures it needs, and the constraints keep it from doing anything dangerous.

top 50 comments
sorted by: hot top controversial new old
[–] irmadlad@lemmy.world 2 points 1 hour ago (1 children)

Forgive my lack of understanding, but basically you have set up an automation system that starts/stops/upgrades/updates docker containers, and system management type of tasks? Do you pipe all this data to some type of monitoring dashboard....maybe something like Grafana? It seems like there would be a lot of data points that could/should be monitored. Do you get text/email alerts that confirm all is copacetic or not?

It sounds spectacular. Maybe a little too complicated for me to wrap my old head around all at once. One of these days, hopefully, I'm going to get AI into the lab as a useful tool and not as just a oddity that takes forever to compute.

Rock on with yo' bad self bro! Thanks for sharing.

[–] variety4me@lemmy.zip 2 points 46 minutes ago

I have not yet tried dthat. but thats the next step i should take

[–] midribbon_action@lemmy.blahaj.zone 17 points 7 hours ago* (last edited 7 hours ago) (5 children)

It seems the main use case is restarting docker containers, why not use the built-in healthcheck feature of docker? The automatic backup and upgrade are also confusing to me, operating systems come with that built in. I just don't quite understand the point of replacing existing deterministic systems with a natural language interface, I would have trouble believing the logs at face value.

Edit: also your handling of current-alert.txt is a perfect example of a race condition, another potential source of indeterminism. An alert could be missed if the file is overwritten before being handled.

[–] variety4me@lemmy.zip 1 points 6 hours ago (1 children)

Its a homelab, not a commercial production environment, agree with you, but I am not too worried about it.

[–] midribbon_action@lemmy.blahaj.zone 9 points 6 hours ago (2 children)

I guess I'm confused... The built in functionality seems like the easier way to accomplish the same, you seemed to have spent a large amount of time and are proud of this project, and wanted to share it, but also acknowledge that it's worse than what already exists, and uses more resources idly. Why should anybody else do this?

[–] DeadDigger@lemmy.zip 3 points 6 hours ago (1 children)

I mean it is an interesting test for ai capabilities and limitations. You have an existing low tech deterministic use case and setup and can compare that with the ai setup

[–] midribbon_action@lemmy.blahaj.zone 3 points 6 hours ago (2 children)

There is no comparison, I made the comparison myself. In all honesty I feel like they didn't know about basic docker and linux concepts until my comment.

[–] DeadDigger@lemmy.zip 3 points 6 hours ago (1 children)

Well you asked why anybody else would do it and I answered on that

Are you working at an ai startup or university? I asked op for their motivations, and comparisons to existing solutions seemed like the least of their concerns, maybe even unconsidered. But I guess it could be a fair answer from your pov if you are trying to test and improve llms. I just hope you're getting paid for the research.

[–] variety4me@lemmy.zip 1 points 6 hours ago (1 children)

I knew dockhand/portainer would do docker updates better, i knew auto updates can be setup via cron for os updates, etc.

i am neither a sys admin, nor a programmer, i just run a hobby homelab and like to tinker and learn. its a good enough usecase for me to explore the possibilities

I knew dockhand/portainer would do docker updates better, i knew auto updates can be setup via cron for os updates, etc.

Well, that could've been mentioned? Why did I have to bring that up? Nothing about the post is self reflective, it is entirely bragging. I get you didn't write it, and that's just how llms sound, but you did decide to post it.

[–] variety4me@lemmy.zip 0 points 6 hours ago (1 children)

So dont do it!

Its a learning experience, how can a coding agent be used in a non coding way? is it better or worse? i guess i have my answers now,

this may not be the ideal usecase, but it surely shows that these agents can be used for other things.

[–] midribbon_action@lemmy.blahaj.zone 1 points 5 hours ago (1 children)

What answer did you arrive at? Are you planning on ending the test?

[–] variety4me@lemmy.zip 1 points 5 hours ago (1 children)

Its has produced great documentation for my homelab. Thats what it did best, could not have done it without having it conduct the tasks it was asked to do

[ranjan@llmbox Homelab Wiki]$ tree -L 2
.
├── Clients
│   ├── CachyOS-Laptop.md
│   └── README.md
├── Infrastructure
│   ├── README.md
│   ├── Router.md
│   └── Switch.md
├── README.md
├── Servers
│   ├── dms.md
│   ├── llmbox.md
│   ├── mediabox.md
│   ├── nasbox.md
│   ├── README.md
│   └── Router.md
└── Services
    ├── AdGuard-Home.md
    ├── BentoPDF.md
    ├── Beszel.md
    ├── Caddy.md
    ├── Collabora.md
    ├── Degoog.md
    ├── Dockhand.md
    ├── Flaresolverr.md
    ├── FMD.md
    ├── Food.md
    ├── Forgejo.md
    ├── Glance.md
    ├── Gonic.md
    ├── Homepage.md
    ├── Immich.md
    ├── Invidious.md
    ├── IT-Tools.md
    ├── Jellyfin.md
    ├── Jotty.md
    ├── Linkding.md
    ├── Llama-Swap.md
    ├── Metube.md
    ├── NFS.md
    ├── Ntfy.md
    ├── Omnitools.md
    ├── OpenCloud.md
    ├── Prowlarr.md
    ├── qBittorrent.md
    ├── Rackpeek.md
    ├── Radarr.md
    ├── Radicale.md
    ├── README.md
    ├── Redlib.md
    ├── SABnzbd.md
    ├── SearXNG.md
    ├── Seerr.md
    ├── Signal-DMS.md
    ├── Sonarr.md
    ├── Speedtest.md
    ├── Tailscale.md
    ├── Termix.md
    ├── Transmission.md
    ├── Uptime-Kuma.md
    └── Vaultwarden.md

Wow amazing, an llm that can generate text!

I'm still curious though if you are going to change your approach after this test.

load more comments (4 replies)
[–] magnue@lemmy.world 12 points 8 hours ago (1 children)

"single source of truth" gives me PTSD from the last wanker consultant that was hired at work to spew bullshit and fire people.

[–] variety4me@lemmy.zip 4 points 8 hours ago* (last edited 8 hours ago)

At 56, i was laid off from a Fortune 500 company, so i hear you. Today I am without a job just trying to learn and keep up everyday.

Edit:Spellings

[–] call_me_xale@lemmy.zip 56 points 11 hours ago (6 children)

ai; dr

If you couldn't be bothered to write this up yourself, why should I spend my time reading it?

load more comments (6 replies)
[–] Fedegenerate@fedinsfw.app 2 points 6 hours ago (1 children)

I've been dreaming of local AI for a hot minute. Sadly corporate AI has priced me out of personal computing, and current hardware isn't up to it.

Maybe my n100 could, I don't really care how slow it's generating the tokens if it's just resetting containers and logging errors.

Does your setup handle automagic updates, how do you handle prompt injection if so? Or, just error correction/logging?

[–] variety4me@lemmy.zip 2 points 6 hours ago* (last edited 6 hours ago) (1 children)

Look at the my setup...

  1. The Intel Xeon E-2224G was a server/workstation processor with 4 cores, launched in May 2019
  2. DDR4 32 GB non branded sticks

Its ancient cheap hardware. Whats stopping you?

Does your setup handle automagic updates, how do you handle prompt injection if so? Or, just error correction/logging?

Figure it out yourself with your agent. What suits your use case? what would you like the agent to do? how much of a risk can you take with your agent. It would vary depending on so many factors

[–] Fedegenerate@fedinsfw.app 1 points 5 hours ago

Sum total of my hardware:

Ugreen dxp 4800, the Pentium one. 32gb ram. My main box: jellyfin, arrs, immich, pihole, nginx, etc... It can't go down. I don't think I want an LLM here.

Beelink n100 16gb ram. Local back up, redundant pihole, immich machine learning... Generally under utilised, I'd like to move some services around, does proxmox have an auto balancer?!

Spare no name n100. 8, or 16gb, I can't remember. Abandoned box, It's what I would put the LLM on, I did have it reserved for a remote back up.

I think it would need a ram upgrade, see corporate AI pricing me out of personal computing. Currently Amazon has a Crucial 32gb ddr5 sodimm module for £280. Which is too high a price for what I'd use it for.

Oh, and I have an abandoned gaming rig with a gtx970, and some rPi0/3s

I've put Ollama on an n100 before. It obviously ate all of that box and made everything on it chug, and it was too slow for human use. But if it's just generating logs, and resetting containers then I wouldn't mind how slow it is.

[–] one_old_coder@piefed.social 31 points 12 hours ago

The comment below is written by my agent:

You're absolutely right, that's very interesting /s

[–] melmi@lemmy.blahaj.zone 12 points 10 hours ago (1 children)

Having an autonomous LLM agent in a homelab like this seems like just a matter of time before things go wrong, but it seems like an interesting experiment.

Have you had any issues with the agent behaving unexpectedly?

[–] variety4me@lemmy.zip 3 points 9 hours ago (1 children)

my sudoers file restricts what the llm can actually do, also I have robust backups can can spin up any of my servers really quickly, I am not that worried and just like you deal with human errors, you can deal with agent errors.

so far this has been running for a month, no scares or unexpected behaviour other than looping on a task somethimes

Sorry I know you probably don't want another tip from me, but the post did include the agent directly using the docker daemon, which runs as root typically. Because you didn't mention running rootless docker or podman, your sudoers file probably allows the agent full access to root instead of preventing it.

[–] blarg_dunsen@sh.itjust.works 4 points 10 hours ago (2 children)

How are you running a 34B model without a GPU? You must be getting one token an hour! How much RAM do you have in the LLM box?

[–] cecilkorik@lemmy.ca 1 points 7 hours ago

Not what OP is using obviously, but AMD X3D CPUs and Mac systems can be quite competitive for AI if you're lacking VRAM. Not all CPUs struggle with inference, and some GPUs aren't so hot at it either. GPUs are generally better, especially the really high-end ones, but throwing in low- and mid-range cards and high-end CPUs stuff starts to look somewhat muddier.

[–] variety4me@lemmy.zip 3 points 9 hours ago

Its an MoE model (https://en.wikipedia.org/wiki/Mixture_of_experts), only 3B parameters are actually active

I have 32GB RAM

[–] crash_thepose@lemmy.ml 6 points 11 hours ago (5 children)

When you have a local llm, is it still relying on the energy resources of open ai or the like? Sorry for the dumb question

[–] SatyrSack@quokk.au 7 points 10 hours ago (1 children)

Originally training the model had used the energy resources of that original corporation or whatever. But when you download that model and start running it on your own hardware, you are using your own energy.

Think of it kind of like some software like Jellyfin. When the developers write the software, they do so using their own electricity. But when you download Jellyfin and actually run the software on your own hardware, you are now only using your electricity, not the developer's electricity at all.

[–] crash_thepose@lemmy.ml 1 points 2 hours ago

Thank you for explaining!

[–] variety4me@lemmy.zip 4 points 11 hours ago (9 children)

The local LLM is run on the homelab, just like immich is run on your homelab and doesnt talk to google photos is any way, its the same for my model, self contained, inhouse with no data leaving my network

load more comments (9 replies)
load more comments (3 replies)
[–] 0x0f@piefed.social 2 points 8 hours ago (1 children)

Thanks for sharing this, I have been looking for an AI setup without GPU, so this is right up my alley.

[–] variety4me@lemmy.zip 2 points 8 hours ago

Welcome! if you have questions on ik build parameters for optimizations feel free to ask, I will try my best to answer

load more comments
view more: next ›