Availability and Dependency Playbook (500 / 503 / ServiceUnavailable)

Q: How should teams separate 500 from 503 during live incidents?

500 indicates origin code or runtime failed while handling the request. 503 indicates the service cannot serve requests temporarily due to overload controls, maintenance, or upstream unavailability. Route-level error shape, breaker state, and dependency health metrics separate the two paths quickly.

Q: How should teams use Retry-After with 503 responses?

Retry-After provides a recovery window that clients should honor before next attempt. Backoff with jitter still matters because many clients retry at the same time after window expiry. Correct Retry-After usage reduces retry storms and lowers secondary 503 waves.

Q: When should responders escalate versus retry?

Responders should retry transient 503 paths when dependency health and breaker metrics show recovery trend. Responders should escalate immediately when 500 spikes include unhandled exceptions, crash loops, or hard dependency failures. Responders should escalate 503 incidents when Retry-After windows repeat without recovery and regional health remains degraded.

Q: How should GCP UNAVAILABLE signals be diagnosed differently from AWS and Azure availability failures?

GCP diagnosis should start with Cloud Monitoring error-rate metrics by service, route, and region to confirm scope and timing. Cloud Error Reporting should then group exception fingerprints so teams can separate recurring crash signatures from transient backend saturation. UNAVAILABLE maps to transient backend unavailability and transport-level disruption, while INTERNAL maps to origin-side execution faults and application defects. This split keeps retry decisions on transient paths and escalation decisions on true origin failures.

Use this playbook to separate origin-side 500 failures from temporary 503 dependency or capacity outages, then apply safe retry and escalation paths.

Last reviewed: March 4, 2026|Editorial standard: source-backed operational guidance

Quick Triage

-Classify semantics first: treat 500 as origin-side execution failure, and treat 503 or ServiceUnavailable as temporary availability loss from overload, maintenance, or upstream outage.
-Capture one failing request with `request_id`, `trace_id`, route, status code, dependency target, exception fingerprint, health-check status, and `Retry-After` value.
-Determine whether the incident comes from origin bug, capacity shedding, or upstream dependency unavailability.

Execution Steps

1Reproduce one failing flow and preserve request-response evidence for the exact route and dependency path.
2Inspect origin logs for unhandled exceptions, crash signatures, and dependency call failures that map to 500.
3Check AWS ServiceUnavailable signals, route-level error rates, and regional health indicators for the failing path.
4Check Azure ServerBusy-service-unavailable signals and GCP UNAVAILABLE signals for the same route and dependency path.
5Verify `Retry-After` header value on the failing 503 response and confirm client backoff honors the declared window.
6Check circuit-breaker state and health-check outcomes for origin and upstream services to confirm recovery readiness.
7Apply minimal origin fix or availability remediation for the exact route-upstream tuple, retest, and choose retry or escalation based on sustained failure class.

Diagnosis Dimensions

-Protocol semantics: 500 indicates origin execution failure, while 503 and ServiceUnavailable indicate temporary inability to serve traffic safely.
-Layer ownership: origin application faults drive 500 paths, while capacity controls and dependency outages drive 503 availability paths.
-Scope correctness: account, project, region, route, and dependency-target alignment separates localized outages from broader regressions.
-Safety checks: bounded retries, `Retry-After` honoring, circuit-breaker thresholds, and health-gated traffic restore availability without amplifying failures.

Verify Recovery

-Replay the original failing flow and confirm stable success for the intended route and dependency path.
-Verify synthetic overload or maintenance probes still trigger expected 503 protection behavior and do not mask new 500 faults.
-Observe 500, 503, ServiceUnavailable, UNAVAILABLE, and breaker-open rates by route and region for at least one sustained traffic window.

What To Avoid

-Avoid raising global concurrency or retry limits before isolating the failing origin or dependency.
-Avoid treating every 503 as an origin bug when capacity shedding or maintenance controls trigger the response.
-Avoid retrying non-idempotent writes during outage bursts without idempotency protection.
-Avoid disabling circuit breakers or health checks to force temporary green status.

Pro Tip

-Log `request_id`, `trace_id`, `status_code`, `error_class`, `dependency`, `retry_after`, `breaker_state`, and `healthcheck_result` on every 500 and 503 event.
-Maintain an ownership matrix that maps each critical route to origin team, dependency owner, escalation trigger, and retry policy.

Frequently Asked Questions

How should teams separate 500 from 503 during live incidents?

500 indicates origin code or runtime failed while handling the request. 503 indicates the service cannot serve requests temporarily due to overload controls, maintenance, or upstream unavailability. Route-level error shape, breaker state, and dependency health metrics separate the two paths quickly.

How should teams use Retry-After with 503 responses?

Retry-After provides a recovery window that clients should honor before next attempt. Backoff with jitter still matters because many clients retry at the same time after window expiry. Correct Retry-After usage reduces retry storms and lowers secondary 503 waves.

When should responders escalate versus retry?

Responders should retry transient 503 paths when dependency health and breaker metrics show recovery trend. Responders should escalate immediately when 500 spikes include unhandled exceptions, crash loops, or hard dependency failures. Responders should escalate 503 incidents when Retry-After windows repeat without recovery and regional health remains degraded.

How should GCP UNAVAILABLE signals be diagnosed differently from AWS and Azure availability failures?

GCP diagnosis should start with Cloud Monitoring error-rate metrics by service, route, and region to confirm scope and timing. Cloud Error Reporting should then group exception fingerprints so teams can separate recurring crash signatures from transient backend saturation. UNAVAILABLE maps to transient backend unavailability and transport-level disruption, while INTERNAL maps to origin-side execution faults and application defects. This split keeps retry decisions on transient paths and escalation decisions on true origin failures.

Official References

Related Error Pages

HTTP 500 HTTP 503 AWS ServiceUnavailable AZURE ServerBusy GCP UNAVAILABLE

<- All Playbooks

Availability and Dependency Playbook (500 / 503 / ServiceUnavailable)

Use this playbook to separate origin-side 500 failures from temporary 503 dependency or capacity outages, then apply safe retry and escalation paths.

Last reviewed: March 4, 2026|Editorial standard: source-backed operational guidance

Quick Triage

-Classify semantics first: treat 500 as origin-side execution failure, and treat 503 or ServiceUnavailable as temporary availability loss from overload, maintenance, or upstream outage.
-Capture one failing request with `request_id`, `trace_id`, route, status code, dependency target, exception fingerprint, health-check status, and `Retry-After` value.
-Determine whether the incident comes from origin bug, capacity shedding, or upstream dependency unavailability.

Execution Steps

1Reproduce one failing flow and preserve request-response evidence for the exact route and dependency path.
2Inspect origin logs for unhandled exceptions, crash signatures, and dependency call failures that map to 500.
3Check AWS ServiceUnavailable signals, route-level error rates, and regional health indicators for the failing path.
4Check Azure ServerBusy-service-unavailable signals and GCP UNAVAILABLE signals for the same route and dependency path.
5Verify `Retry-After` header value on the failing 503 response and confirm client backoff honors the declared window.
6Check circuit-breaker state and health-check outcomes for origin and upstream services to confirm recovery readiness.
7Apply minimal origin fix or availability remediation for the exact route-upstream tuple, retest, and choose retry or escalation based on sustained failure class.

Diagnosis Dimensions

-Protocol semantics: 500 indicates origin execution failure, while 503 and ServiceUnavailable indicate temporary inability to serve traffic safely.
-Layer ownership: origin application faults drive 500 paths, while capacity controls and dependency outages drive 503 availability paths.
-Scope correctness: account, project, region, route, and dependency-target alignment separates localized outages from broader regressions.
-Safety checks: bounded retries, `Retry-After` honoring, circuit-breaker thresholds, and health-gated traffic restore availability without amplifying failures.

Verify Recovery

-Replay the original failing flow and confirm stable success for the intended route and dependency path.
-Verify synthetic overload or maintenance probes still trigger expected 503 protection behavior and do not mask new 500 faults.
-Observe 500, 503, ServiceUnavailable, UNAVAILABLE, and breaker-open rates by route and region for at least one sustained traffic window.

What To Avoid

-Avoid raising global concurrency or retry limits before isolating the failing origin or dependency.
-Avoid treating every 503 as an origin bug when capacity shedding or maintenance controls trigger the response.
-Avoid retrying non-idempotent writes during outage bursts without idempotency protection.
-Avoid disabling circuit breakers or health checks to force temporary green status.

Pro Tip

-Log `request_id`, `trace_id`, `status_code`, `error_class`, `dependency`, `retry_after`, `breaker_state`, and `healthcheck_result` on every 500 and 503 event.
-Maintain an ownership matrix that maps each critical route to origin team, dependency owner, escalation trigger, and retry policy.

Frequently Asked Questions

How should teams separate 500 from 503 during live incidents?

How should teams use Retry-After with 503 responses?

When should responders escalate versus retry?

How should GCP UNAVAILABLE signals be diagnosed differently from AWS and Azure availability failures?

Official References

Related Error Pages

HTTP 500 HTTP 503 AWS ServiceUnavailable AZURE ServerBusy GCP UNAVAILABLE