Failover Testing¶

Section: High Availability | Article 19
Audience: System Administrators
Last Updated: 2026-04-07

Overview¶

After configuring your RP-PAM HA cluster, you should test failover to confirm that:

The standby node promotes to leader when the primary fails.
Promotion completes within 15 seconds.
No data is lost during the transition.
Active sessions survive the failover.
The original primary node rejoins as a standby when restarted.

This article provides step-by-step procedures for both Linux and Windows environments.

Prerequisites¶

Requirement	Detail
Working HA cluster	At least 2 nodes configured per HA Multi-Node Setup
Redis running	Connected and healthy (see Redis Setup)
Cluster status healthy	All nodes showing `"status": "healthy"`
Admin JWT token	A valid authentication token for API calls
A test user account	For verifying session continuity

Step 1: Verify Cluster Is Healthy Before Testing¶

Confirm the cluster is in a known-good state before introducing failures.

Linux:

curl -s http://localhost:7101/system/health/cluster \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

PowerShell:

$cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$cluster | ConvertTo-Json -Depth 5

Confirm: - clusterHealthy is true - All nodes show "status": "healthy" - redisConnected is true - quorumMet is true - Note which node is leaderNode (e.g., "node1")

Step 2: Create a Test Session (Optional but Recommended)¶

To verify session continuity, log in to the web portal and leave the session open during the failover.

Open a browser and navigate to https://your-pam-server:7101 (or the VIP / load balancer address).
Log in with a test user account.
Navigate to any page (e.g., the Dashboard or Access Requests).
Keep this browser tab open throughout the test.

Step 3: Stop the Leader Node¶

Simulate a leader failure by stopping the RP-PAM service on the current leader.

On Linux (assuming node1 is the leader)¶

# On node1:
sudo systemctl stop rppam

# Record the time
echo "Leader stopped at: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

On Windows (assuming node1 is the leader)¶

# On node1:
Stop-Service RpPam

# Record the time
Write-Host "Leader stopped at: $(Get-Date -Format 'yyyy-MM-ddTHH:mm:ssZ')"

Note: Use stop (graceful shutdown), not kill. A graceful shutdown simulates the most common failure mode. For a hard failure test, see Advanced: Hard Failure Test below.

Step 4: Verify Standby Promotes to Leader¶

Within 15 seconds, the standby node should detect the leader failure and promote itself.

Monitor From the Standby Node (node2)¶

Linux:

# Poll cluster status until node2 becomes leader
for i in $(seq 1 30); do
    LEADER=$(curl -sf http://localhost:7101/system/health/cluster \
      -H "Authorization: Bearer $ADMIN_JWT" 2>/dev/null | jq -r .leaderNode)
    echo "$(date -u +%H:%M:%S) — Leader: $LEADER"
    if [ "$LEADER" = "node2" ]; then
        echo "SUCCESS: node2 promoted to leader"
        break
    fi
    sleep 1
done

PowerShell:

# Poll cluster status until node2 becomes leader
for ($i = 1; $i -le 30; $i++) {
    try {
        $cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
          -Headers @{ Authorization = "Bearer $adminJwt" } -ErrorAction SilentlyContinue
        $leader = $cluster.leaderNode
        Write-Host "$(Get-Date -Format 'HH:mm:ss') - Leader: $leader"
        if ($leader -eq "node2") {
            Write-Host "SUCCESS: node2 promoted to leader"
            break
        }
    } catch {
        Write-Host "$(Get-Date -Format 'HH:mm:ss') - Waiting..."
    }
    Start-Sleep -Seconds 1
}

Expected Timeline¶

Time	Event
T+0s	Leader node stopped
T+5s	Standby detects missed heartbeat
T+10s	Leader lock expires in Redis
T+10-15s	Standby acquires leader lock and promotes
T+15s	Standby is serving as leader; VIP transferred (if configured)

If promotion takes longer than 15 seconds, check: - leaderLockTtlSeconds value (default 30, can be reduced to 15 for faster failover) - Redis latency (high latency delays lock operations) - Network latency between nodes

Step 5: Verify No Data Loss¶

After the standby has promoted, verify that recent data is intact.

Linux:

# List recent audit events (these should include events from before the failover)
curl -s http://localhost:7101/api/v1/audit/events?limit=10 \
  -H "Authorization: Bearer $ADMIN_JWT" | jq '.items[].timestamp'

# Check a known resource or user
curl -s http://localhost:7101/api/v1/users?limit=5 \
  -H "Authorization: Bearer $ADMIN_JWT" | jq '.items[].username'

PowerShell:

# List recent audit events
$events = Invoke-RestMethod -Uri "http://localhost:7101/api/v1/audit/events?limit=10" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$events.items | Select-Object timestamp, eventType | Format-Table

# Check a known resource or user
$users = Invoke-RestMethod -Uri "http://localhost:7101/api/v1/users?limit=5" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$users.items | Select-Object username | Format-Table

For `local-sync` Mode¶

If your cluster uses local-sync database mode, additionally verify that the standby's database contains all expected records:

# Compare record counts (run on node2 after promotion)
curl -s http://localhost:7101/system/health/database \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .replicationStatus

Expected: "replicationStatus": "in-sync" or a minimal lag (< 1 second).

Step 6: Verify Active Sessions Survive¶

Return to the browser tab you left open in Step 2.

Refresh the page. It should load normally without requiring re-authentication.
Navigate to a different page. Verify the portal is fully functional.
Submit a test action (e.g., view a resource detail). Confirm the action completes successfully.

If using a VIP or load balancer, the browser should connect seamlessly to the new leader. If connecting directly to node1, you will need to change the URL to node2.

Note: If the session does not survive, check that Redis is properly storing session tokens. Sessions are cached in Redis and are available to all nodes regardless of which node originally issued the token.

Step 7: Restart the Original Leader¶

Restart the stopped node. It should automatically rejoin the cluster as a standby.

Linux (on node1)¶

sudo systemctl start rppam

# Wait for it to become healthy
until curl -sf http://localhost:7101/system/health/ping | grep -q '"status":"healthy"'; do
    sleep 2
done
echo "node1 is healthy and has rejoined the cluster"

Windows (on node1)¶

Start-Service RpPam

# Wait for it to become healthy
do {
    Start-Sleep -Seconds 2
    $health = Invoke-RestMethod -Uri "http://localhost:7101/system/health/ping" -ErrorAction SilentlyContinue
} while ($health.status -ne "healthy")

Write-Host "node1 is healthy and has rejoined the cluster"

Step 8: Verify Final Cluster Status¶

From either node, confirm the cluster is fully healthy with both nodes online.

Linux:

curl -s http://localhost:7101/system/health/cluster \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

PowerShell:

$cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
  -Headers @{ Authorization = "Bearer $adminJwt" }
$cluster | ConvertTo-Json -Depth 5

Expected:

Field	Value
`clusterHealthy`	`true`
`leaderNode`	`"node2"` (the promoted node remains leader)
node1 `role`	`"standby"`
node1 `status`	`"healthy"`
node2 `role`	`"primary"`
node2 `status`	`"healthy"`

Note: node2 remains the leader after node1 rejoins. RP-PAM does not automatically fail back. If you want node1 to be the leader again, you can trigger a manual failover (see below).

Optional: Manual Failover (Fail Back)¶

If you want to return leadership to node1:

Linux:

sudo /opt/rppam/tools/rppam-cluster failover \
  --target node1 \
  --confirm

PowerShell:

& "C:\Program Files\Ravenphyre\RP-PAM\tools\rppam-cluster.exe" failover `
  --target node1 `
  --confirm

This performs a graceful leadership transfer: node2 releases the leader lock, node1 acquires it, and services are redirected with minimal disruption.

Advanced: Hard Failure Test¶

To simulate a sudden crash (power loss, kernel panic):

Linux¶

# WARNING: This immediately kills the process without graceful shutdown
sudo kill -9 $(pgrep -f rppam)

Windows¶

# WARNING: This immediately terminates the process
Stop-Process -Name "Ravenphyre.RpPam.Host" -Force

After a hard failure, the failover timeline may be slightly longer (up to the full leaderLockTtlSeconds) because the killed node cannot release its lock gracefully. The standby must wait for the lock to expire.

Failover Test Checklist¶

Use this checklist to document your failover test results:

[ ] Cluster was healthy before test (all nodes green, quorum met)
[ ] Leader node stopped successfully
[ ] Standby promoted within _____ seconds (target: < 15)
[ ] VIP transferred correctly (if applicable)
[ ] API responds on new leader
[ ] No data loss confirmed (audit events, users, resources all present)
[ ] Active browser session survived failover
[ ] Original leader restarted and rejoined as standby
[ ] Final cluster status is healthy with all nodes
[ ] Test date: ______
[ ] Tested by: ______

Troubleshooting¶

Problem	Cause	Solution
Standby does not promote	Redis unreachable from standby	Verify standby can reach Redis; check `redis-cli ping` from standby
Promotion takes > 30 seconds	Leader lock TTL too long	Reduce `leaderLockTtlSeconds` to 15 in `rppam.config`
Data missing after failover (`local-sync`)	Replication lag before failure	Check `replicationStatus`; consider reducing `outboxPollIntervalSeconds`
Session lost after failover	Redis not storing sessions	Verify `redis.enabled` is `true` and Redis is reachable from all nodes
Original node does not rejoin	Configuration mismatch	Ensure `peerEndpoints` on the restarted node includes the current leader
Split-brain (both nodes claim leader)	Redis connection issue during election	Check Redis connectivity from both nodes; consider adding a witness node

Next Steps¶

Service Account Setup (AD) -- Configure Active Directory service accounts
Witness Node Setup -- Add a tiebreaker for even-number clusters
VIP Failover Configuration -- Configure a virtual IP

Failover Testing¶

Overview¶

Prerequisites¶

Step 1: Verify Cluster Is Healthy Before Testing¶

Step 2: Create a Test Session (Optional but Recommended)¶

Step 3: Stop the Leader Node¶

On Linux (assuming node1 is the leader)¶

On Windows (assuming node1 is the leader)¶

Step 4: Verify Standby Promotes to Leader¶

Monitor From the Standby Node (node2)¶

Expected Timeline¶

Step 5: Verify No Data Loss¶

For local-sync Mode¶

Step 6: Verify Active Sessions Survive¶

Step 7: Restart the Original Leader¶

Linux (on node1)¶

Windows (on node1)¶

Step 8: Verify Final Cluster Status¶

Optional: Manual Failover (Fail Back)¶

Advanced: Hard Failure Test¶

Linux¶

Windows¶

Failover Test Checklist¶

Troubleshooting¶

Next Steps¶

For `local-sync` Mode¶