Debugging the Dreaded NaN
Briefly

NaNs (Not a Number) in deep learning models can cause significant disruptions during training, leading to costly inefficiencies. Their unpredictable nature makes them hard to debug, as they can appear irregularly depending on model state, input data, and other stochastic factors. To effectively address this challenge in PyTorch workloads, a systematic debugging method is proposed which includes saving copies of input batches, checking for NaN gradients, and creating checkpoints. This approach enhances debugging precision while leveraging the advantages of PyTorch Lightning to streamline machine learning development.
NaNs in Deep Learning workloads are amongst the most frustrating issues to encounter, as their sporadic nature and dependence on various factors complicate debugging.
Due to the considerable cost of AI model training, having dedicated tools for capturing and analyzing NaN occurrences is strongly recommended to prevent waste.
In this post, we demonstrate a debugging mechanism for NaNs in PyTorch that includes saving copies of inputs and model states to reproduce errors.
PyTorch Lightning offers an effective way to implement NaN debugging, providing conveniences that simplify the development of machine learning models.
Read at towardsdatascience.com
[
|
]