Use this playbook to separate origin-side 500 failures from temporary 503 dependency or capacity outages, then apply safe retry and escalation paths.
Last reviewed: February 23, 2026|Editorial standard: source-backed operational guidance
500 indicates origin code or runtime failed while handling the request. 503 indicates the service cannot serve requests temporarily due to overload controls, maintenance, or upstream unavailability. Route-level error shape, breaker state, and dependency health metrics separate the two paths quickly.
Retry-After provides a recovery window that clients should honor before next attempt. Backoff with jitter still matters because many clients retry at the same time after window expiry. Correct Retry-After usage reduces retry storms and lowers secondary 503 waves.
Responders should retry transient 503 paths when dependency health and breaker metrics show recovery trend. Responders should escalate immediately when 500 spikes include unhandled exceptions, crash loops, or hard dependency failures. Responders should escalate 503 incidents when Retry-After windows repeat without recovery and regional health remains degraded.
GCP diagnosis should start with Cloud Monitoring error-rate metrics by service, route, and region to confirm scope and timing. Cloud Error Reporting should then group exception fingerprints so teams can separate recurring crash signatures from transient backend saturation. UNAVAILABLE maps to transient backend unavailability and transport-level disruption, while INTERNAL maps to origin-side execution faults and application defects. This split keeps retry decisions on transient paths and escalation decisions on true origin failures.