Skip to content

Failover Testing

Section: High Availability | Article 19
Audience: System Administrators
Last Updated: 2026-04-07


Overview

After configuring your RP-PAM HA cluster, you should test failover to confirm that:

  1. The standby node promotes to leader when the primary fails.
  2. Promotion completes within 15 seconds.
  3. No data is lost during the transition.
  4. Active sessions survive the failover.
  5. The original primary node rejoins as a standby when restarted.

This article provides step-by-step procedures for both Linux and Windows environments.


Prerequisites

Requirement Detail
Working HA cluster At least 2 nodes configured per HA Multi-Node Setup
Redis running Connected and healthy (see Redis Setup)
Cluster status healthy All nodes showing "status": "healthy"
Admin JWT token A valid authentication token for API calls
A test user account For verifying session continuity

Step 1: Verify Cluster Is Healthy Before Testing

Confirm the cluster is in a known-good state before introducing failures.

Linux:

curl -s http://localhost:7101/system/health/cluster \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

PowerShell:

$cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$cluster | ConvertTo-Json -Depth 5

Confirm: - clusterHealthy is true - All nodes show "status": "healthy" - redisConnected is true - quorumMet is true - Note which node is leaderNode (e.g., "node1")


To verify session continuity, log in to the web portal and leave the session open during the failover.

  1. Open a browser and navigate to https://your-pam-server:7101 (or the VIP / load balancer address).
  2. Log in with a test user account.
  3. Navigate to any page (e.g., the Dashboard or Access Requests).
  4. Keep this browser tab open throughout the test.

Step 3: Stop the Leader Node

Simulate a leader failure by stopping the RP-PAM service on the current leader.

On Linux (assuming node1 is the leader)

# On node1:
sudo systemctl stop rppam

# Record the time
echo "Leader stopped at: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

On Windows (assuming node1 is the leader)

# On node1:
Stop-Service RpPam

# Record the time
Write-Host "Leader stopped at: $(Get-Date -Format 'yyyy-MM-ddTHH:mm:ssZ')"

Note: Use stop (graceful shutdown), not kill. A graceful shutdown simulates the most common failure mode. For a hard failure test, see Advanced: Hard Failure Test below.


Step 4: Verify Standby Promotes to Leader

Within 15 seconds, the standby node should detect the leader failure and promote itself.

Monitor From the Standby Node (node2)

Linux:

# Poll cluster status until node2 becomes leader
for i in $(seq 1 30); do
    LEADER=$(curl -sf http://localhost:7101/system/health/cluster \
      -H "Authorization: Bearer $ADMIN_JWT" 2>/dev/null | jq -r .leaderNode)
    echo "$(date -u +%H:%M:%S) — Leader: $LEADER"
    if [ "$LEADER" = "node2" ]; then
        echo "SUCCESS: node2 promoted to leader"
        break
    fi
    sleep 1
done

PowerShell:

# Poll cluster status until node2 becomes leader
for ($i = 1; $i -le 30; $i++) {
    try {
        $cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
          -Headers @{ Authorization = "Bearer $adminJwt" } -ErrorAction SilentlyContinue
        $leader = $cluster.leaderNode
        Write-Host "$(Get-Date -Format 'HH:mm:ss') - Leader: $leader"
        if ($leader -eq "node2") {
            Write-Host "SUCCESS: node2 promoted to leader"
            break
        }
    } catch {
        Write-Host "$(Get-Date -Format 'HH:mm:ss') - Waiting..."
    }
    Start-Sleep -Seconds 1
}

Expected Timeline

Time Event
T+0s Leader node stopped
T+5s Standby detects missed heartbeat
T+10s Leader lock expires in Redis
T+10-15s Standby acquires leader lock and promotes
T+15s Standby is serving as leader; VIP transferred (if configured)

If promotion takes longer than 15 seconds, check: - leaderLockTtlSeconds value (default 30, can be reduced to 15 for faster failover) - Redis latency (high latency delays lock operations) - Network latency between nodes


Step 5: Verify No Data Loss

After the standby has promoted, verify that recent data is intact.

Linux:

# List recent audit events (these should include events from before the failover)
curl -s http://localhost:7101/api/v1/audit/events?limit=10 \
  -H "Authorization: Bearer $ADMIN_JWT" | jq '.items[].timestamp'

# Check a known resource or user
curl -s http://localhost:7101/api/v1/users?limit=5 \
  -H "Authorization: Bearer $ADMIN_JWT" | jq '.items[].username'

PowerShell:

# List recent audit events
$events = Invoke-RestMethod -Uri "http://localhost:7101/api/v1/audit/events?limit=10" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$events.items | Select-Object timestamp, eventType | Format-Table

# Check a known resource or user
$users = Invoke-RestMethod -Uri "http://localhost:7101/api/v1/users?limit=5" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$users.items | Select-Object username | Format-Table

For local-sync Mode

If your cluster uses local-sync database mode, additionally verify that the standby's database contains all expected records:

# Compare record counts (run on node2 after promotion)
curl -s http://localhost:7101/system/health/database \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .replicationStatus

Expected: "replicationStatus": "in-sync" or a minimal lag (< 1 second).


Step 6: Verify Active Sessions Survive

Return to the browser tab you left open in Step 2.

  1. Refresh the page. It should load normally without requiring re-authentication.
  2. Navigate to a different page. Verify the portal is fully functional.
  3. Submit a test action (e.g., view a resource detail). Confirm the action completes successfully.

If using a VIP or load balancer, the browser should connect seamlessly to the new leader. If connecting directly to node1, you will need to change the URL to node2.

Note: If the session does not survive, check that Redis is properly storing session tokens. Sessions are cached in Redis and are available to all nodes regardless of which node originally issued the token.


Step 7: Restart the Original Leader

Restart the stopped node. It should automatically rejoin the cluster as a standby.

Linux (on node1)

sudo systemctl start rppam

# Wait for it to become healthy
until curl -sf http://localhost:7101/system/health/ping | grep -q '"status":"healthy"'; do
    sleep 2
done
echo "node1 is healthy and has rejoined the cluster"

Windows (on node1)

Start-Service RpPam

# Wait for it to become healthy
do {
    Start-Sleep -Seconds 2
    $health = Invoke-RestMethod -Uri "http://localhost:7101/system/health/ping" -ErrorAction SilentlyContinue
} while ($health.status -ne "healthy")

Write-Host "node1 is healthy and has rejoined the cluster"

Step 8: Verify Final Cluster Status

From either node, confirm the cluster is fully healthy with both nodes online.

Linux:

curl -s http://localhost:7101/system/health/cluster \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

PowerShell:

$cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$cluster | ConvertTo-Json -Depth 5

Expected:

Field Value
clusterHealthy true
leaderNode "node2" (the promoted node remains leader)
node1 role "standby"
node1 status "healthy"
node2 role "primary"
node2 status "healthy"

Note: node2 remains the leader after node1 rejoins. RP-PAM does not automatically fail back. If you want node1 to be the leader again, you can trigger a manual failover (see below).


Optional: Manual Failover (Fail Back)

If you want to return leadership to node1:

Linux:

sudo /opt/rppam/tools/rppam-cluster failover \
  --target node1 \
  --confirm

PowerShell:

& "C:\Program Files\Ravenphyre\RP-PAM\tools\rppam-cluster.exe" failover `
  --target node1 `
  --confirm

This performs a graceful leadership transfer: node2 releases the leader lock, node1 acquires it, and services are redirected with minimal disruption.


Advanced: Hard Failure Test

To simulate a sudden crash (power loss, kernel panic):

Linux

# WARNING: This immediately kills the process without graceful shutdown
sudo kill -9 $(pgrep -f rppam)

Windows

# WARNING: This immediately terminates the process
Stop-Process -Name "Ravenphyre.RpPam.Host" -Force

After a hard failure, the failover timeline may be slightly longer (up to the full leaderLockTtlSeconds) because the killed node cannot release its lock gracefully. The standby must wait for the lock to expire.


Failover Test Checklist

Use this checklist to document your failover test results:

  • [ ] Cluster was healthy before test (all nodes green, quorum met)
  • [ ] Leader node stopped successfully
  • [ ] Standby promoted within _____ seconds (target: < 15)
  • [ ] VIP transferred correctly (if applicable)
  • [ ] API responds on new leader
  • [ ] No data loss confirmed (audit events, users, resources all present)
  • [ ] Active browser session survived failover
  • [ ] Original leader restarted and rejoined as standby
  • [ ] Final cluster status is healthy with all nodes
  • [ ] Test date: ______
  • [ ] Tested by: ______

Troubleshooting

Problem Cause Solution
Standby does not promote Redis unreachable from standby Verify standby can reach Redis; check redis-cli ping from standby
Promotion takes > 30 seconds Leader lock TTL too long Reduce leaderLockTtlSeconds to 15 in rppam.config
Data missing after failover (local-sync) Replication lag before failure Check replicationStatus; consider reducing outboxPollIntervalSeconds
Session lost after failover Redis not storing sessions Verify redis.enabled is true and Redis is reachable from all nodes
Original node does not rejoin Configuration mismatch Ensure peerEndpoints on the restarted node includes the current leader
Split-brain (both nodes claim leader) Redis connection issue during election Check Redis connectivity from both nodes; consider adding a witness node

Next Steps


RP-PAM v1.0.0 -- Copyright 2026 Ravenphyre. All rights reserved.