GPipe and PipeDream: Scaling AI training in every direction

Data science is hard work, not a magical incantation. Whether an AI model performs as advertised depends on how well it’s been trained, and there’s no “one size fits all” approach for training AI models.

The necessary evil of distributed AI training

Scaling is one of the trickiest considerations when training AI models. Training can be especially challenging when a model grows too resource hungry to be processed in its entirety on any single computing platform. A model may have grown so large it exceeds the memory limit of a single processing platform, or an accelerator has required developing special algorithms or infrastructure. Training data sets may grow so huge that training takes an inordinately long time and becomes prohibitively expensive.

Scaling can be a piece of cake if we don’t require the model to be particularly good at its assigned task. But as we ramp up the level of inferencing accuracy required, the training process can stretch on longer and chew up ever more resources. Addressing this issue isn’t simply a matter of throwing more powerful hardware at the problem. As with many application workloads, one can’t rely on faster processors alone to sustain linear scaling as AI model complexity grows.

Distributed training may be necessary. If the components of a model can be partitioned and distributed to optimized nodes for processing in parallel, the time needed to train a model can be reduced significantly. However, parallelization can itself be a fraught exercise, considering how fragile a construct a statistical model can be.

Source link