Programming

21432 readers

259 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

UlrikHD@programming.dev

database greenhorn (discuss.tchncs.de)

submitted 2 months ago* (last edited 2 months ago) by PoisonedPrisonPanda@discuss.tchncs.de to c/programming@programming.dev

49 comments fedilink hide all child comments

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

you are viewing a single comment's thread
view the rest of the comments

[–] PoisonedPrisonPanda@discuss.tchncs.de 2 points 1 month ago (1 children)

A hot DB should not run on HDDs. Slap some nvme storage into that server if you can. If you can’t, consider getting a new server and migrating to it.

Did this because of the convincing replies in this thread. Migrating to modern hardware and switch SQL server with PostgreSQL (because its used by the other system we work with already, and there is know-how available in this domain).

You should avoid scanning an entire table with a huge number of rows when possible, at least during requests.

But how can we then ensure that I am not adding/processing products which are already in the "final" table, when I have no knowledge about ALL the products which are in this final table?

Create an index and a table constraint on the relevant columns. ... just so that the DB can do the work for you. The DB is better at enforcing constraints than you are (when it can do so).

This is helpful and also what I experienced. In the peak of the period where the server was overloaded the CPU load was pretty much zero - all processing happened related to disk read/write. Which was because we implemented poor query design/architecture.

For read-heavy workflows, consider whether caches or read replicas will benefit you.

May you elaborate what you mean with read replicas? Storage in memory?

And finally back to my first point: read. Learn. There are no shortcuts. You cannot get better at something if you don’t take the time to educate yourself on it.

Yes, I will swallow the pill. but thanks to the replies here I have many starting points on where to start.

RTFM is nice - but starting with page 0 is overwhelming.

[–] TehPers@beehaw.org 1 points 1 month ago

But how can we then ensure that I am not adding/processing products which are already in the "final" table, when I have no knowledge about ALL the products which are in this final table?

Without knowledge about your schema, I don't know enough to answer this. However, the database doesn't need to scan all rows in a table to check if a value exists if you can build an index on the relevant columns. If your products have some unique ID (or tuple of columns), then you can usually build an index on those values, which means the DB builds what is basically a lookup table for those indexed columns.

Without going into too much detail, you can think of an index as a way for a DB to make a "contains" (or "retrieve") operation drop from O(n) (check all rows) to some much faster speed like O(log n) for example. The tradeoff is that you need more space for the index now.

This comes with an added benefit that uniqueness constraints can be easily enforced on indexed columns if needed. And yes, your PK is indexed by default.

Read more about index in Postgres's docs. It actually has pretty readable documentation from my experience. Or read a book on indexes, or a video, etc. The concept is universal.

May you elaborate what you mean with read replicas? Storage in memory?

This highly depends on your needs. I'll link PG's docs on replication though.

If you're migrating right now, I wouldn't think about this too much. Replicas basically are duplicates of your database hosted on different servers (ideally in different warehouses, or even different regions if possible). Replicas work together to stay in sync, but depending on the kind of replica and the kind of query, any replica may be able to handle an incoming query (rather than a single central database).

If all you need are backups though, then replicas could be overkill. Either way, you definitely don't want prod data all stored in a single machine, usually. I would talk to your management about backup requirements and potentially availability/uptime requirements.