Failure Model

Detailed analysis of worst-case, best-case, and average RTO calculation logic and results across three classic failure detection/recovery paths

Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.

#	Failure Scenario	Description	Final Path
1	PG process crash	crash, OOM killed	Active Detection
2	PG connection refused	max_connections	Active Detection
3	PG zombie	Process alive but unresponsive	Active Detection (timeout)
4	Patroni process crash	kill -9, OOM	Passive Detection
5	Patroni zombie	Process alive but stuck	Watchdog
6	Node down	Power outage, hardware failure	Passive Detection
7	Node zombie	IO hang, CPU starvation	Watchdog
8	Primary ↔ DCS network failure	Firewall, switch failure	Network Partition
9	Storage failure	Disk failure, disk full, mount failure	Active Detection or Watchdog
10	Manual switchover	Switchover/Failover	Manual Trigger

However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

Feedback

Was this page helpful?

Thanks for the feedback! Please let us know how we can improve.

Sorry to hear that. Please let us know how we can improve.

Last Modified 2026-01-15: update concept/model docs (6b8b468)