No Stupid Questions (Developer Edition)

1117 readers

13 users here now

This is a place where you can ask any programming / topic related to the instance questions you want!

For a more general version of this concept check out !nostupidquestions@lemmy.world

Icon base by Lorc under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

Vacant@programming.dev

The Joy (and Pain) of Debugging Microservices in a Fediverse World (lemmy.world)

submitted 1 week ago by jackevans@lemmy.world to c/no_stupid_questions@programming.dev

2 comments fedilink hide all child comments

Hey, so… you won’t believe this, but I just spent hours chasing one of those bugs that makes you question your life choices

Everything looked fine at first. You know that feeling, right? Pods running, services up, nothing obviously broken. But then… random failures. Logs screaming “connection refused,” traces looking like total nonsense.

So I’m sitting there like, “Okay… what is even happening?”

Well, after digging way too deep, I finally found it. Turns out it was a race condition. Yeah, one of those. It was happening between federation hooks and Redis cache invalidation. Basically, things were happening slightly out of order… just enough to break stuff randomly.

And the worst part? It didn’t fail every time. Only sometimes. Can you imagine that?

So yeah, I kept going back and forth, thinking I fixed it… then boom, same issue again.

Here’s what I ended up doing. Nothing fancy at all. Just added exponential backoff to the retry logic:

async fn retry_federation(activity: Activity, max_retries: u32) -> Result<()> { let mut delay = Duration::from_millis(100); for attempt in 0..max_retries { match send_to_relays(&activity).await { Ok(_) => return Ok(()), Err(e) if attempt < max_retries - 1 => { tokio::time::sleep(delay).await; delay *= 2; } Err(e) => return Err(e), } } Err(anyhow!("Federation failed after {} retries", max_retries)) }

And yeah… that actually fixed it.

Not some big architectural change. Just… “wait a bit and try again” 😄

So yeah, lesson learned, timing issues in distributed systems are sneaky. Especially with federation stuff. Cold starts, retries, cache timing… all of it can mess with you.

Anyway, that was my day. What do you think? Ever had a bug like this where everything looked fine but totally wasn’t?

top 2 comments

sorted by: hot top controversial new old

[–] BB_C@programming.dev 1 points 1 week ago (1 children)

0…max_retries

Good thing Rust replaced ... with ..= for inclusive range syntax. Otherwise, the webshit markdown implementation used by Lemmy UI replacing .. with the … ligature would have been confusingly problematic 😉.

And this seems to be yet another case showing that federation was poorly designed, and should have been designed as pull-based (and batchable/packable), instead of endlessly spam-pushing individual messages and hoping for the best.

[–] jackevans@lemmy.world 1 points 14 hours ago

Haha yeah, the ... vs ..= thing still trips me up sometimes , if Markdown started “optimizing” that, I probably would’ve lost my mind after hours of debugging 😅.

And yep, totally agree on the federation design angle. Push based federation feels like a relic that made sense early on but doesn’t scale well, especially once retries, delivery guarantees, and caching get involved. Pull based + batching could’ve made timing issues way less painful…