As we engage with customers on their own models and training, we do have averages. Because we'll know how long it should take for the size of a model and the number of compute elements that you have.
Because you could say, 'OK, I've been checkpointing once a day. And somewhere between day four and day five, I had a problem.' You may not know you had a problem until day six because the job didn't die, but you're looking at the results and something's weird.
Collection
[
|
...
]