Skip to main content

GPU Health

We have comprehensive systems in place to ensure that the your workloads start and continously run on healthy GPUs:
  • We continiously monitor temperature, frequencies, and error events (slowdowns, caps, etc)
  • We have an automated system that reacts to suspicious metrics and automatically cordons and drains nodes, performs restarts and gpu resets, and runs a suite of stress tests
  • Failures are escalated to data centers for GPU replacement via both automated and manual ways
Additionally, customer have visibility and control over the health of their application:
  • Ability to define custom health checks (invasive or not), see health check endpoint docs
  • Non-invasive custom health check that can run in parallel with requests and are probed every 15 seconds. This is an easy way for your app to check things like memory usage and gracefully terminate when it gets too high.
  • Invasive custom health checks can be configured to run between requests in order to perform more thorough GPU tests
  • During a request, you code can return a 503 error which will cause the worker to be gracefully stopped and replaced, see readiness & liveness docs