Programming

21127 readers

369 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

UlrikHD@programming.dev

database greenhorn (discuss.tchncs.de)

submitted 1 month ago* (last edited 1 month ago) by PoisonedPrisonPanda@discuss.tchncs.de to c/programming@programming.dev

49 comments fedilink hide all child comments

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

you are viewing a single comment's thread
view the rest of the comments

[–] wwb4itcgas@lemm.ee 4 points 1 month ago (1 children)

To paraquote H. L. Mencken: For every problem, there is a solution that's cheap, fast, easy to implement -- and wrong.

Silver bullets and magic wands don't really exist, I'm afraid. There's amble reasons for DBA's being well-paid people.

There's basically three options: Either increase the hardware capabilities to be able to handle the amount of data you want to deal with, decrease the amount of data so that the hardware you've got can handle it at the level of performance you want or... Live with the status quo.

If throwing more hardware at the issue was an option, I presume you would just have done so. As for how to viably decrease the amount of data in your active set, well, that's hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series? If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage? Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

[–] PoisonedPrisonPanda@discuss.tchncs.de 1 points 1 month ago (1 children)

To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

This can be the new slogan of our development. :')

I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

so no. No narrowing.

Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

[–] wwb4itcgas@lemm.ee 1 points 1 month ago

Sorry for taking so long to get back to you on this, but I'm not always on Lemmy. There's always more code to be written - you know how it is, I'm sure.

Given the constraints you outline, one other avenue of attack could be to consider the time-sensitivity of product updates and the relative priority thereof. If it's acceptable for updates to products to lag somewhat, you can at least perform them at a lower rate over longer time, thus reducing hardware load at any given time. If the periodic updates are make to the same per-product values, you could even potentially get smart and replace queued updates not yet performed, if they're superseded by a subsequent change before they're actually committed thus further reducing load.