We have comprehensive systems in place to ensure that the your workloads start and continously run on healthy GPUs:
We continiously monitor temperature, frequencies, and error events (slowdowns, caps, etc)
We have an automated system that reacts to suspicious metrics and automatically cordons and drains nodes, performs restarts and gpu resets, and runs a suite of stress tests
Failures are escalated to data centers for GPU replacement via both automated and manual ways
Additionally, customer have visibility and control over the health of their application:
Non-invasive custom health check that can run in parallel with requests and are probed every 15 seconds. This is an easy way for your app to check things like memory usage and gracefully terminate when it gets too high.
Invasive custom health checks can be configured to run between requests in order to perform more thorough GPU tests
During a request, you code can return a 503 error which will cause the worker to be gracefully stopped and replaced, see readiness & liveness docs