Hardware faults, termed 'Silent data corruptions' (SDCs) by Meta, can corrupt AI model parameters, leading to incorrect or degraded model outputs, impacting AI service quality.
Meta researchers suggest measuring hardware faults using the 'parameter vulnerability factor' (PVF) to standardize AI model vulnerability against parameter corruptions, adaptable to different hardware fault models and tasks.
The complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, complicating fault detection, as documented by Meta.
Meta simulated silent corruption incidents with 'DLRM', emphasizing the need to address and quantify AI model vulnerability against hardware faults to ensure reliable AI services.
Collection
[
|
...
]