121

Toyota says plant shutdown last week due to Zero disk space (www.reuters.com)

submitted 1 year ago by oohgodyeah@lemmy.world to c/sysadmin@lemmy.world

17 comments fedilink hide all child comments

I'm sending this to my boss to remind him why monitoring disk space is vital.

all 18 comments

sorted by: hot top controversial new old

[-] Pons_Aelius@kbin.social 51 points 1 year ago* (last edited 1 year ago)

$100 says there is a series of emails sent by a sysadmin/DBA over the past couple months warning about this issue in explicit detail and its increasing urgency, that have been ignored.

The person sending the emails will still get chewed out because they failed to make the higher ups realise this is a real problem.

[-] ShunkW@lemmy.world 10 points 1 year ago* (last edited 1 year ago)

I used to be a sysadmin, now a software developer. At one of my old jobs for a massive corporation, they decided to consolidate several apps' db servers onto one host. We found out about this after it had already happened because they at least properly setup cname records so it was seamless to us. Some data was lost though, but having literally billions of records in our db, we didn't notice until it triggered a scream test for our users. We were also running up against data storage limits

They ended up undoing the change which caused us a data merge nightmare that lasted several full workdays.

[-] emmanuel_car@kbin.social 9 points 1 year ago

…it triggered a scream test for our users.

This phrase has brought me much joy.

[-] ShunkW@lemmy.world 7 points 1 year ago

It's such an accurate term. I worked in IAM for a while and when no one claimed ownership of an application account, we'd go with a scream test. Lock the account and see who screams at us lol.

[-] luciferofastora@discuss.online 4 points 1 year ago

We had that some time ago with a service account for a specific system where individual personal accounts weren't (yet) feasible. The credentials were supposed to be treated with confidence and not shared without the admins' approval. Yeah, you can guess how that went.

When the time came to migrate access to the system to a different solution using personal accounts, it was announced that the service account password would be changed and henceforth kept under strict control by the sysadmin, who would remotely enter it where it was needed but never hand it out in clear text. That announcement was sent to all the authorised credential holders with the instruction to pass it on if anyone else had been given access, and repeated shortly before the change.

The change was even delayed for some sensitive reasons, but eventually went through. Naturally, everyone was prepared, had gone through the steps to request the new access and all was well. Nobody called to complain about things breaking, no error tickets were submitted to entirely unrelated units that had to dig around to find out who was actually responsible, and all lived happily ever after. In particular, the writer of this post was blissfully left alone and not involuntarily crowned the main point of contact by any upset users passing their name on to other people the writer had never even seen the name of.

[-] ShunkW@lemmy.world 5 points 1 year ago

When I was working in that old job we had one particular fiasco that legit stresses me to remember. We have this account, no one knows what it does, but the password has never been rotated, it's not vaulted, etc. There's 5 apps that share the DB. I contact all the app owners, no response.

I wait a week and escalate the their bosses. No response. I send emails every single day to everyone including all the dev teams. Not one "lemme check on that" or anything. Our policy was to wait 90 freaking days for a non single user account. I'm getting yelled at to get this ticket closed when the day comes.

I go in, lock the account, change the password, and kill all DB sessions. Within 15 minutes I'm paged for a priority one incident because a trading app is down, causing the whole floor to be out and they're losing millions every minute.

I tell them what I did and forwarded emails to everyone. The executive director is screaming at me, telling me I'm gonna be fired soon and I better fix it right now.

Sure, I can unlock the account and even force the password back to the old version. What's that? No one knows what the old password was? Nothing I can do. Fortunately my executive director was awesome and stepped in to take the call. Overall they were down for an hour and a half. I looked at the incident later and they claimed $100 million in losses. The app owners wanted me fired. They got the uno reverse though and lost their jobs over it.

Fuck that job lol.

[-] luciferofastora@discuss.online 2 points 1 year ago

Our system wasn't quite as critical, thankfully, but the app owners failing to respond to "Hey, by the way, your service account for your data base is gonna be closed" is just gross negligence. My condolences that you had to take the brunt of their scrambling to cover their asses.

For all the complaints I may have about certain processes and keeping certain stakeholders in the loop about changing the SQL Views they depend on, at least I acknowledge that plenty of people did heed the announcement and make the switch. It's just that the "Oops, that mail must have drowned in my pile of IDGAF what our sysadmins are writing about again. Can't you just give me the new password again, pretty please?" are far more visible.

[-] Brkdncr@kbin.social 2 points 1 year ago

The only thing worse than a single database server is servers poorly maintained database servers. The idea was right, but maybe the implementation was wrong.

[-] Toad_the_Fungus@kbin.social 8 points 1 year ago

my $100 goes to them storing so much of their customers' personal data on the servers

[-] Pons_Aelius@kbin.social 6 points 1 year ago

While that story is shitty I doubt the manufacturing control DB and customer data DB are anywhere near each other.

[-] Faceman2K23@discuss.tchncs.de 4 points 1 year ago

Of course, to get anything done by corporate japan you need to put it in writing and fax it.

[-] MrPoopyButthole@lemmy.world 13 points 1 year ago* (last edited 1 year ago)

I had a call recording server crash 2 days ago and when inspected I found some jackass had partitioned the drive into 50G and 450G partitions and then never used the 450G one. The root partition had 20kb free space remaining. The DB was so beyond fucked I had to create a new server.

[-] kylian0087@lemmy.world 0 points 1 year ago

I hope their where at least backups... so rebuild and restore.

[-] MelodiousFunk@kbin.social 10 points 1 year ago

I bet at one time they had a functional threshold alerting system. Then someone missed something (because they're human) and management ordered more alerts "so it doesn't happen again." Wash, rinse, repeat over the course of years (combined with VM sprawl and acquiring competitors) until there's no semblance of sanity left, having gone far past notification fatigue and well into "my job is just checking email and updating tickets now." But management insists that all of those alerts are needed because Joe Bob missed an email... which there are now exponentially more of... and the board is permanently half red anyway because the CTO (bless his sociopathic heart) decreed that 80% is the company standard for alerts and a bunch of stuff just lives there happily so good luck seeing something new.

...I was not expecting to process that particular trauma this evening.