this post was submitted on 02 May 2025
17 points (87.0% liked)
Programming
20694 readers
364 users here now
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities !webdev@programming.dev
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
When detecting duplicates gets expensive, the secret is to process them anyway, but in a way that de-duplicates the result of processing them.
Usually, that means writing the next processing step into a (new) table whose primary key contains every detail that could make a record a duplicate.
Then, as all the records are processed, just let it overwrite that same record with each duplicate.
The resulting table is a list of keys containing no duplicates.
(Tip: This can be a good process to run overnight.)
(Tip: be sure the job also marks each original record as processed/deduped, so the overnight job only ever has to look at new unprocessed/un-deduped records.)
Then, we drive all future processing steps from that new de-duplicated table, joining back to only whichever of the duplicate records was processed last for the other record details. (Since they're duplicates anyway, we don't care which one wins, as long as only one does.)
This tends to result in a first single pass through the full data to process to create the de-duplicated list, and then a second pass through the de-duplicated list for all remaining steps. So roughly
2n
processing time.(But the first
n
can be a long running background job, and the secondn
can be optimized by indexes supporting the needs of each future processing step.)