Programming

21848 readers

481 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

UlrikHD@programming.dev

database greenhorn (discuss.tchncs.de)

submitted 2 months ago* (last edited 2 months ago) by PoisonedPrisonPanda@discuss.tchncs.de to c/programming@programming.dev

49 comments fedilink hide all child comments

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

you are viewing a single comment's thread
view the rest of the comments

[–] pinball_wizard@lemmy.zip 1 points 2 months ago* (last edited 2 months ago)

When detecting duplicates gets expensive, the secret is to process them anyway, but in a way that de-duplicates the result of processing them.

Usually, that means writing the next processing step into a (new) table whose primary key contains every detail that could make a record a duplicate.

Then, as all the records are processed, just let it overwrite that same record with each duplicate.

The resulting table is a list of keys containing no duplicates.

(Tip: This can be a good process to run overnight.)

(Tip: be sure the job also marks each original record as processed/deduped, so the overnight job only ever has to look at new unprocessed/un-deduped records.)

Then, we drive all future processing steps from that new de-duplicated table, joining back to only whichever of the duplicate records was processed last for the other record details. (Since they're duplicates anyway, we don't care which one wins, as long as only one does.)

This tends to result in a first single pass through the full data to process to create the de-duplicated list, and then a second pass through the de-duplicated list for all remaining steps. So roughly 2n processing time.

(But the first n can be a long running background job, and the second n can be optimized by indexes supporting the needs of each future processing step.)