Tradeoffs

How to balance RTO and RPO parameters to achieve an HA solution that fits your business needs?

Overview

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two parameters requiring careful tradeoffs when designing high availability clusters.

  • RTO (Recovery Time Objective) defines the maximum time needed to restore write capability when the primary fails.
  • RPO (Recovery Point Objective) defines the maximum amount of data that can be lost when the primary fails.

The default RTO and RPO values used by Pigsty meet reliability requirements for most scenarios. You can adjust them based on your hardware level, network quality, and business requirements.

The upper limit of unavailability during failover is controlled by the pg_rto parameter. RTO defaults to 30s. Increasing it results in longer primary failure write unavailability, while decreasing it increases false positive failover rates (e.g., repeated switching due to brief network jitter).

The upper limit of potential data loss is controlled by the pg_rpo parameter, defaulting to 1MB. Reducing this value lowers the data loss ceiling during failover but increases the probability of refusing automatic failover when replicas are not healthy enough (lagging too far).


Parameter Details

pg_rto

Recovery Time Objective (RTO) in seconds, defaults to 30 seconds, meaning the primary can be expected to restore write capability within 30 seconds after failure.

If the primary instance is missing for this long, a new leader election will be triggered. This value is not always better when lower; it involves tradeoffs: Reducing this value decreases cluster failover unavailability (inability to write) but makes the cluster more sensitive to short-term network jitter, increasing false positive failover triggers. You need to configure this value based on network conditions and business constraints, making a tradeoff between failure probability and failure impact.

pg_rpo

Recovery Point Objective (RPO) in bytes, default: 1048576 (1MiB), meaning up to 1MiB of data loss can be tolerated during failover.

When the primary goes down and all replicas are lagging, you must make a difficult choice: Either promote a replica immediately, accepting acceptable data loss (e.g., less than 1MB), and restore service quickly. Or wait for the primary to come back online (which may never happen) to avoid any data loss, or abandon automatic failover and wait for human intervention. You need to configure this value based on business preference, making a tradeoff between availability and consistency.

Additionally, you can always ensure RPO = 0 by enabling synchronous commit (e.g., using the crit.yml template), sacrificing cluster latency/throughput performance.


Protection Modes

For RPO (data loss) tradeoffs, you can reference Oracle Data Guard’s three protection mode design philosophy.

Oracle Data Guard provides three protection modes. PostgreSQL can also implement these three classic HA switchover strategies through configuration.

Maximum Performance

  • Default mode, using Patroni config templates other than crit.yml, Patroni synchronous_mode: false, sync replication disabled
  • Transaction commit only requires local WAL persistence, no replica wait, replica failures are completely transparent to primary
  • Primary failure may lose unsent/unreceived WAL (typically < 1MB, normally 10ms/100ms latency, 10KB/100KB magnitude under good network)
  • Optimized for performance, suitable for regular business scenarios that can tolerate minor data loss during failures

Maximum Availability

  • Pigsty uses crit.yml template with synchronous_mode: true + synchronous_mode_strict: false
  • Under normal conditions, waits for at least one replica to confirm, achieving zero data loss. When all sync replicas fail, automatically degrades to async mode to continue service
  • Balances data safety and service availability, recommended configuration for production core business

Maximum Protection

  • Pigsty uses crit.yml template with Patroni synchronous_mode: true and further configured to strict sync mode
  • Configure synchronous_mode_strict: true, when all sync replicas fail, primary refuses writes to prevent data loss
  • Can configure synchronous_commit: 'remote_apply' for read-write consistency
  • Transactions must be persisted on at least one replica before returning success, can specify sync replica list, configure more sync replicas for better disaster tolerance
  • Suitable for financial transactions, medical records, and other scenarios requiring extremely high data integrity
DimensionMax PerformanceMax AvailabilityMax Protection
NameMaximum PerformanceMaximum AvailabilityMaximum Protection
ReplicationAsyncSyncSync
Data LossPossible (replication lag)Zero normally, minor when degradedZero
Primary Write LatencyLowestMedium (+1 network round trip)Medium (+1 network round trip)
ThroughputHighestLowerLower
Replica Failure ImpactNoneAuto degrade, continue servicePrimary stops writes
RPO< 1MB= 0 (normal) / < 1MB (degraded)= 0
Use CaseRegular business, performance-firstImportant business, safety-firstFinancial core, compliance-first
Pigsty ConfigDefaultpg_conf: crit.ymlpg_conf: crit.yml + strict mode

The three modes differ in how Patroni’s two core parameters are configured: synchronous_mode and synchronous_mode_strict

  • synchronous_mode: Whether Patroni enables sync replication. If enabled, check synchronous_mode_strict for strict sync mode.
  • synchronous_mode_strict = false: Default, allows degrading to async mode when replicas fail, primary continues service (Maximum Availability)
  • synchronous_mode_strict = true: Disables degradation, primary stops writes until sync replicas recover (Maximum Protection)
Modesynchronous_modesynchronous_mode_strictReplica Failure Behavior
Max Performancefalse-No impact
Max AvailabilitytruefalseAuto degrade to async
Max ProtectiontruetruePrimary refuses writes

Pigsty uses availability-first mode by default, meaning it will failover as quickly as possible when the primary fails, and data not yet replicated to replicas may be lost (under typical 10GbE networks, replication lag is usually a few KB to 100KB). If you need zero data loss during failover, use the crit.yml template for consistency-first mode to ensure no data loss during failover, but this sacrifices some performance (latency/throughput).

Maximum Protection mode requires additional synchronous_mode_strict: true configuration, which causes the primary to stop write service when all sync replicas fail! Ensure you have at least two sync replicas, or accept this risk before enabling this mode.


Failover Timing

For RTO (recovery time) tradeoffs, analyze using the Patroni failure detection and switchover timing diagram below.

Recovery Time Objective (RTO) consists of multiple phases:

gantt
    title RTO Time Breakdown (Default config pg_rto=30s)
    dateFormat ss
    axisFormat %Ss

    section Failure Detection
    Patroni detect/stop lease renewal    :a1, 00, 10s

    section Election Phase
    Etcd lease expires           :a2, after a1, 2s
    Candidate election (compare LSN)    :a3, after a2, 3s

    section Promotion Phase
    Execute promote            :a4, after a3, 3s
    Update Etcd state          :a5, after a4, 2s

    section Traffic Switch
    HAProxy detects new primary      :a6, after a5, 5s
    HAProxy confirm (rise)     :a7, after a6, 3s
    Service restored                :milestone, after a7, 0s
ParameterImpactTuning Advice
pg_rtoBaseline for TTL/loop_wait/retry_timeoutCan reduce to 15-20s on stable networks
ttlFailure detection time window= pg_rto
loop_waitPatroni check interval= pg_rto / 3
interHAProxy health check intervalCan reduce to 1-2s
fallFailure determination countCan reduce to 2
riseRecovery determination countCan reduce to 2

When the primary fails, the system goes through these phases:

sequenceDiagram
    autonumber
    participant Primary as 🟢 Primary
    participant Patroni_P as Patroni (Primary)
    participant Etcd as 🟠 Etcd Cluster
    participant Patroni_R as Patroni (Replica)
    participant Replica as 🔵 Replica
    participant HAProxy as HAProxy

    Note over Primary: T=0s Primary failure occurs

    rect rgb(255, 235, 235)
        Note right of Primary: Failure Detection Phase (0-10s)
        Primary-x Patroni_P: Process crash
        Patroni_P--x Etcd: Stop lease renewal
        HAProxy--x Patroni_P: Health check fails
        Etcd->>Etcd: Lease countdown starts
    end

    rect rgb(255, 248, 225)
        Note right of Etcd: Election Phase (10-20s)
        Etcd->>Etcd: Lease expires, release leader lock
        Patroni_R->>Etcd: Check eligibility (LSN, replication lag)
        Etcd->>Patroni_R: Grant leader lock
    end

    rect rgb(232, 245, 233)
        Note right of Replica: Promotion Phase (20-30s)
        Patroni_R->>Replica: Execute PROMOTE
        Replica-->>Replica: Promoted to new primary
        Patroni_R->>Etcd: Update state
        HAProxy->>Patroni_R: Health check /primary
        Patroni_R-->>HAProxy: 200 OK
    end

    Note over HAProxy: T≈30s Service restored
    HAProxy->>Replica: Route write traffic to new primary

Key timing formula:

RTO ≈ TTL + Election_Time + Promote_Time + HAProxy_Detection

Where:
- TTL = pg_rto (default 30s)
- Election_Time ≈ 1-2s
- Promote_Time ≈ 1-5s
- HAProxy_Detection = fall × inter + rise × fastinter ≈ 12s

Actual RTO is typically 15-40s, depending on:
- Network latency
- Replica WAL replay progress
- PostgreSQL recovery speed