Machine Learning

2005 readers

4 users here now

founded 5 years ago

MODERATORS

osipov@lemmy.ml

[Solved] PyTorch Lightning is bottlenecked by the CPU (lemmy.ml)

submitted 2 years ago* (last edited 2 years ago) by kernelPanic@lemmy.ml to c/machinelearning@lemmy.ml

2 comments fedilink hide all child comments

When I train my PyTorch Lightning model on two GPUs on jupyter lab with strategy="ddp_notebook", only two CPUs are used and their usages are 100%. How can I overcome this CPU bottleneck?

Edit: I tested with PyTorchProfiler and it was because of old ssds used on the server

top 2 comments

sorted by: hot top controversial new old

[–] Spott@lemmy.world 4 points 2 years ago (1 children)

Without knowing more, I would expect it is a dataloader issue: your CPUs are bottlenecked trying to get enough data to your GPUs.

You can add more workers to your dataloader in order to paralyze it, though this can lead to weird parallelization bugs sometimes, so if things start acting weird, that might be a reason.

[–] troye888@lemmy.one 3 points 2 years ago

Yup this, if you would like more help we need the code, or at least a minimal viable reproduction scenario.