this post was submitted on 05 Aug 2023
45 points (100.0% liked)
Reddthat Support -> Has moved
325 readers
1 users here now
Reddthat Community Support Forum
Before posting, have you read the rules?
Introductory Required Reading
You are ready to start your adventure on Reddthat but are still unsure? That's fine! You've come to the right place.
- Ideas? Post-em
- Issues? Post-em
- Queries? Post-em
- Ideas to help Reddthat? Post-em
Alternative Support Forums
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
These were because of recent spam bots.
I made some changes today. We now have 4 containers for the UI (we only had 1 before) and 4 for the backend (we only had 2)
It seems that when you delete a user, and you tell lemmy to also remove the content (the spam) it tells the database to mark all of the content as deleted.
Kbin.social had about 30 users who posted 20/30 posts each which I told Lemmy to delete.
This only marks it as deleted for Reddthat users until the mods mark the post as deleted and it federates out.
The problem
The UPDATE in the database (marking the spam content as deleted) takes a while and the backend waits(?) for the database to finish.
Even though the backend has 20 different connections to the database it uses 1 connection for the UPDATE, and then waits/gets stuck.
This is what is causing the outages unfortunately and it's really pissing me off to be honest. I can't remove content / action reports without someone seeing an error.
I don't see any solutions on the 0.18.3 release notes that would solve this.
Temp Solution
So to combat this a little I've increased our backend processes from 2 to 4 and our front-end from 1 to 4.
My idea is that if 1 of the backend processes gets "locked" up while performing tasks, the other 3 processes should take care of it.
This unfortunately is an assumption because if the "removal" performs an UPDATE on the database and the /other/ backend processes are aware of this and wait as well... This would count as "locking" up the database and it won't matter how many processes I scale out too, the applications will lockup and cause us downtime.
Next Steps
Note: we are kinda doing #3 point already it does a round-robbin (tries each sequentially). But from what I've seen in part of the logs it can't differentiate between one that is down and one that is up. (From the nginx documentation, that feature is a paid one)
Cheers, Tiff
Updates hiding in the comments again!
We are now using v0.18.3!
There was extended downtime because docker wouldn't cooperate AT ALL.
The nginx proxy container would not resolve the DNS. So after rebuilding the containers twice and investigating the docker network settings, a "simple" reboot of the server fixed it!
upstream lemmy-ui
&upstream lemmy
. These are DNS entries which are cached for a period of time. So if a new container comes online it doesn't actually find the new containers because it cached all the IPs thatlemmy-ui
resolves too. (In this example it would have been only 1, and then we add more containers the proxy would never find them). 4.1 You can read more here: http://forum.nginx.org/read.php?2,215830,215832#msg-215832I get notified whenever reddthat goes down, most of the time it coincided with me banning users and removing content. So I didn't look into it much, but honestly the uptime isn't great. (Red is <95% uptime, which means we were down for 1 hour!).
Actually, it is terrible.
With the changes we've made i'll be monitoring it over the next 48 hours and confirm that we no longer have any real issues. Then i'll make a real announcement.
Thanks all for joining our little adventure!
Tiff
For number 4, can you set a cron job to constantly flush DNS cache?
It's the internal nginx cache. It /shouldn't/ be a problem once I update the configuration to handle it.
We can add a
resolver
line withvalid=5s
so it will recheck every 5 seconds instead of whatever the internal docker TTL cache is.