UNAVAILABLE
GCP UNAVAILABLE means the service is currently unavailable or the serving path could not accept the call. It is usually transient, but only explicitly safe calls should be retried with backoff.
Last reviewed: April 16, 2026|Source-backed guidance under our editorial policy
Start Here
Use the closest compare guide, playbook, or adjacent error page to narrow the decision faster before you start changing production systems.
This page is part of the Error Reference library. Learn more about the project or report a correction.
What Does Unavailable Mean?
The caller could not reach a healthy serving path or the service could not accept the call right now. UNAVAILABLE is often transient, but the fastest way to prolong the incident is to let every client retry immediately without budgets or idempotency controls.
Common Causes
- -A zonal or regional serving path is degraded, overloaded, or temporarily offline.
- -Upstream dependency or control-plane health problems make the backend unable to accept or complete the call.
- -Connection pool, subchannel, DNS, or TLS establishment failures break the serving path before application logic can respond.
- -Clients retry too aggressively and amplify partial degradation into a wider outage.
- -Autoscaling lag, cold capacity, or connection warm-up leaves no healthy path under burst load.
How to Fix Unavailable
- 1Retry only explicitly safe or idempotent operations with bounded exponential backoff and full jitter.
- 2Capture peer, region, attempt count, and connection state to determine whether the outage is localized or global.
- 3Reduce concurrency, open circuit breakers, or fail over to a healthy region before adding more retry pressure.
- 4Align retries, connection budgets, and deadlines so recovery traffic does not outrun the remaining service budget.
Step-by-Step Diagnosis for Unavailable
- 1Capture failure samples with endpoint, region, zone, retry attempt, peer address, and latency per hop.
- 2Inspect DNS resolution, TLS handshakes, connection pools, subchannel health, and load-balancer endpoint selection.
- 3Differentiate a localized serving-path outage from client-caused retry amplification or channel exhaustion.
- 4Re-test with bounded concurrency and controlled retries to prove whether the service stabilizes when pressure is reduced.
- 5Compare healthy and failing regions or zones to isolate whether the incident is path-specific.
Seen in Production
- -One zone behind a regional endpoint loses healthy backends, so only callers pinned to that path see UNAVAILABLE while neighboring zones remain healthy.
- -A dependency brownout causes the service to shed requests with UNAVAILABLE before it can build full responses, and immediate client retries double the pressure.
- -A rollout opens too many new channels at once, connection warm-up stalls, and short bursts of UNAVAILABLE appear before autoscaling catches up.
- -Client libraries with no retry budget fan out thousands of instant retries, turning a partial outage into a broader saturation event.
Serving Path and Scope Audit
- -Correlate UNAVAILABLE spikes with endpoint, region, and zone health (example: one backend pool is degraded while others still pass health checks).
- -Trace dependency and control-plane failures feeding the serving path (example: downstream auth or datastore outage makes the frontend reject calls before useful work begins).
Retry Amplification and Channel Health
- -Inspect retry counts, backoff behavior, and retry budgets (example: immediate retries triple request rate during a 2-minute brownout).
- -Check channel, subchannel, and connection-pool state (example: exhausted connection pool or repeated TLS failures collapse one client cohort into UNAVAILABLE).
Decision Shortcut: Local Path vs Global Outage
- -If one region or zone is noisy while others stay clean, prioritize serving-path isolation and failover before widening global deadlines.
- -If every caller cohort fails at once, inspect shared dependencies, load-balancer health, or retry storms before blaming one client library.
- -If read traffic recovers with jittered retries but writes still need caution, keep idempotency boundaries explicit instead of treating all operations the same.
Wrong Fix to Avoid
- -Do not let every client retry immediately with no retry budget, especially during partial outages.
- -Do not widen deadlines first if the real problem is an unhealthy path, exhausted channel pool, or dependency outage.
- -Do not assume every mutation is safe to replay just because the top-level code is transient.
Implementation Examples
{
"requestId": "req_8f11ca",
"status": "UNAVAILABLE",
"region": "us-central1",
"zone": "us-central1-b",
"attempt": 3,
"peer": "dns:///inventory-grpc",
"message": "upstream connect error or disconnect/reset before headers"
}grpcurl -plaintext us-central1-inventory.internal:8080 grpc.health.v1.Health/Check{
"retryPolicy": {
"maxAttempts": 4,
"initialBackoff": "0.5s",
"maxBackoff": "5s",
"backoffMultiplier": 2,
"retryableStatusCodes": ["UNAVAILABLE"]
}
}Incident Timeline
15:03 UTC
One serving path or dependency begins to degrade
Signal: Zone-level health, channel establishment, or a critical dependency starts failing before aggregate latency fully spikes.
Why it matters: The earliest useful signal is usually where healthy path selection breaks, not the final gRPC status alone.
15:05 UTC
Callers start seeing UNAVAILABLE from the affected path
Signal: Requests routed to the degraded path fail fast while healthy zones or endpoints may still serve successfully.
Why it matters: This is where locality matters: one broken path can look global if clients retry without visibility.
15:06 UTC
Unbounded retries amplify the incident
Signal: Retry volume rises, connection pools churn, and the extra load spills into paths that were previously healthy.
Why it matters: Retry behavior can become the second outage if budgets and jitter are missing.
15:14 UTC
Traffic is throttled, failed paths are isolated, and service recovers
Signal: Failover, bounded retries, and reduced concurrency restore healthy request flow without overloading the remaining backends.
Why it matters: That confirms the fix lived in serving-path recovery and traffic discipline rather than broad timeout inflation.
Seen in Production
Zonal backend outage causes temporary service unavailability
Frequency: common
Example: Traffic routed to one zone starts returning UNAVAILABLE while neighboring zones continue serving successfully.
Fix: Shift traffic to healthy zones, reduce retry amplification, and restore the affected backend pool.
Client retry storm amplifies partial backend degradation
Frequency: common
Example: Aggressive immediate retries double request volume and collapse connection pools during a short dependency brownout.
Fix: Deploy jittered backoff with retry budgets and concurrency caps.
Connection warm-up failure makes new traffic cohorts flap
Frequency: medium
Example: A rollout opens too many fresh connections at once, TLS or subchannel establishment stalls, and callers intermittently see UNAVAILABLE.
Fix: Warm channels intentionally, cap concurrent startup traffic, and monitor pool churn during rollout.
Wrong Fix vs Better Fix
Fan-out retries vs bounded retry budgets
Wrong fix: Let every caller retry immediately until the transient outage clears.
Better fix: Bound retries to explicitly safe paths, add full jitter, and cap total retry volume per caller cohort.
Why this is better: UNAVAILABLE is often transient, but retry amplification can convert a partial outage into a longer saturation event.
Raise deadlines vs restore a healthy serving path
Wrong fix: Increase deadlines globally because requests are failing during the outage window.
Better fix: Isolate the unhealthy path, restore dependency health, or fail over before changing end-to-end timeout budgets.
Why this is better: Longer waits do not fix a dead endpoint, broken channel pool, or unavailable dependency.
Retry writes like reads vs preserve idempotency boundaries
Wrong fix: Treat every operation as safely replayable because the top-level code is transient.
Better fix: Retry reads and explicitly safe operations automatically, and gate mutating retries behind idempotency or reconciliation.
Why this is better: Transient transport failure does not automatically make duplicate side effects safe.
Debugging Tools
- -Cloud Monitoring availability and latency dashboards
- -Endpoint health checks and regional status telemetry
- -Distributed tracing for dependency-hop failures
- -Retry-budget and concurrency instrumentation
- -gRPC channel or subchannel state telemetry
How to Verify the Fix
- -Replay representative workloads and confirm UNAVAILABLE rate returns within SLO thresholds.
- -Validate retries now converge without causing secondary latency, connection churn, or error spikes.
- -Confirm healthy-region or failover paths can serve traffic during simulated disruption.
- -Verify retry-budget and concurrency metrics remain bounded during recovery tests.
How to Prevent Recurrence
- -Standardize retry budgets, jitter, and timeout policy across clients, gateways, and background workers.
- -Implement circuit breaking, backpressure, outlier detection, and graceful degradation for dependency outages.
- -Continuously test failover and recovery playbooks with synthetic disruption drills.
- -Monitor channel health, connection-pool churn, and retry volume as first-class outage signals.
Pro Tip
- -cap maximum concurrent retries per caller cohort so one degraded path cannot trigger a fleet-wide retry storm.
Decision Support
Official References
Provider Context
This guidance is specific to GCP services. Always validate implementation details against official provider documentation before deploying to production.