HA and Cluster Troubleshooting¶
Section: Troubleshooting | Article 50
Audience: System Administrators
Last Updated: 2026-04-07
Overview¶
This article covers issues specific to RP-PAM High Availability (HA) multi-node deployments. For general troubleshooting, see General Troubleshooting. For HA setup, see HA Multi-Node Setup.
Checking Cluster Status¶
Before troubleshooting, check the current cluster state.
Via REST API¶
PowerShell:
Invoke-RestMethod -Uri "https://rppam.corp.local:7101/api/v1/admin/cluster/status" `
-Headers @{ Authorization = "Bearer $adminJwt" } | ConvertTo-Json -Depth 3
curl:
curl -s "https://rppam.corp.local:7101/api/v1/admin/cluster/status" \
-H "Authorization: Bearer $ADMIN_JWT" | jq .
Healthy cluster response:
{
"clusterId": "cluster-001",
"status": "healthy",
"leaderNodeId": "node-1",
"nodes": [
{
"nodeId": "node-1",
"hostname": "rppam-01.corp.local",
"role": "leader",
"status": "healthy",
"lastHeartbeat": "2026-04-07T14:30:00Z"
},
{
"nodeId": "node-2",
"hostname": "rppam-02.corp.local",
"role": "standby",
"status": "healthy",
"lastHeartbeat": "2026-04-07T14:30:01Z"
},
{
"nodeId": "node-3",
"hostname": "rppam-03.corp.local",
"role": "standby",
"status": "healthy",
"lastHeartbeat": "2026-04-07T14:29:59Z"
}
],
"quorum": true,
"redisStatus": "connected"
}
Per-Node Health¶
Check each node individually:
# Node 1
curl -s http://rppam-01.corp.local:7101/system/health/ping | jq .
# Node 2
curl -s http://rppam-02.corp.local:7101/system/health/ping | jq .
# Node 3
curl -s http://rppam-03.corp.local:7101/system/health/ping | jq .
Nodes Cannot Communicate¶
Symptoms¶
- Cluster status shows nodes as "unreachable"
- Heartbeat timestamps are stale (more than 30 seconds old)
- Log shows
NodeCommunicationExceptionorHeartbeat timeout for node
Diagnosis¶
Step 1: Test inter-node connectivity.
From Node 1 to Node 2:
# PowerShell
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7101
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7102 # Cluster communication port
Step 2: Check firewall rules. RP-PAM uses two ports for HA:
| Port | Purpose | Direction |
|---|---|---|
| 7101 | API and health checks | Inbound from clients and other nodes |
| 7102 | Cluster communication (gossip, heartbeats, leader election) | Inbound from other nodes only |
Windows:
Linux:
Step 3: Verify cluster configuration. Each node must know about the other nodes:
Check rppam.config on each node:
{
"cluster": {
"nodeId": "node-1",
"clusterPort": 7102,
"peers": [
"rppam-02.corp.local:7102",
"rppam-03.corp.local:7102"
]
}
}
Ensure the peers list is correct and consistent across all nodes.
Leader Election Failing¶
Symptoms¶
- Cluster status shows
"leaderNodeId": null - All nodes report role as "standby" or "candidate"
- Log shows
Leader election timed outorNo quorum available
Cause¶
Leader election requires a quorum (majority of nodes). If more than half the nodes are down or unreachable, no leader can be elected.
| Total Nodes | Quorum Requirement | Tolerated Failures |
|---|---|---|
| 2 | 2 (both must agree) | 0 |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
Solution¶
Step 1: Ensure enough nodes are online:
for node in rppam-01 rppam-02 rppam-03; do
echo -n "$node: "
curl -s -o /dev/null -w "%{http_code}" http://$node.corp.local:7101/system/health/ping
echo
done
Step 2: If nodes are online but election still fails, check Redis connectivity (Redis is required for distributed locking):
Step 3: If a node is stuck in "candidate" state, restart it:
Step 4: If the cluster is in a persistent election loop, check for clock skew between nodes. Leader election uses timestamps for lease management:
# Check time on each node
for node in rppam-01 rppam-02 rppam-03; do
echo -n "$node: "
ssh $node date -u
done
Ensure NTP is synchronized across all nodes (maximum acceptable skew: 5 seconds).
Split-Brain¶
Symptoms¶
- Two nodes both claim to be the leader
- Clients get inconsistent responses depending on which node they reach
- Log shows
Split-brain detectedwarnings
Cause¶
Split-brain occurs when network partitioning causes nodes to lose communication with each other but remain individually healthy. Each partition may elect its own leader.
Immediate Response¶
Step 1: Identify the situation:
# Check each node's view of the cluster
for node in rppam-01 rppam-02 rppam-03; do
echo "=== $node ==="
curl -s http://$node.corp.local:7101/api/v1/admin/cluster/status \
-H "Authorization: Bearer $ADMIN_JWT" | jq '{leaderNodeId, quorum}'
done
Step 2: RP-PAM's split-brain protection uses the witness node (or Redis-based fencing) to resolve conflicts. If the protection did not trigger:
-
Stop the minority partition (the side with fewer nodes):
-
Verify the majority partition is healthy:
-
Fix the network issue that caused the partition.
-
Restart the stopped node(s):
They will rejoin as standbys.
Prevention¶
- Use a witness node for 2-node clusters — see Witness Node Setup
- Use 3 or 5 nodes for automatic quorum-based protection
- Ensure reliable networking between nodes (same subnet recommended)
Redis Connection Issues¶
Symptoms¶
- Cluster operations fail or are degraded
- Log shows
RedisConnectionExceptionorRedis connection refused - Session data not shared between nodes (users logged out when hitting a different node)
Diagnosis¶
Step 1: Test Redis connectivity from the RP-PAM server:
# PowerShell (using Test-NetConnection)
Test-NetConnection -ComputerName "redis.corp.local" -Port 6379
Step 2: If Redis requires authentication:
Step 3: Check Redis configuration in rppam.config:
{
"cluster": {
"redis": {
"connectionString": "redis.corp.local:6379,password=your-redis-password,ssl=false,abortConnect=false",
"instanceName": "rppam:"
}
}
}
Common Redis Issues¶
| Issue | Cause | Solution |
|---|---|---|
| Connection refused | Redis not running or wrong port | Start Redis; verify port |
| Authentication failed | Wrong password | Update password in rppam.config |
| SSL/TLS error | ssl=true but Redis doesn't have TLS |
Set ssl=false or configure Redis TLS |
| Timeout | Redis overloaded or network latency | Check Redis memory usage; optimize eviction policy |
| Data lost after Redis restart | Redis not configured for persistence | Enable AOF or RDB persistence in redis.conf |
Check Redis health:
redis-cli -h redis.corp.local info server | head -10
redis-cli -h redis.corp.local info memory | head -10
redis-cli -h redis.corp.local info clients | head -10
Failover Not Completing¶
Symptoms¶
- Leader node went down but standby did not promote
- Clients receive errors or timeouts instead of being served by a standby
- Log shows
Failover initiatedbut neverFailover completed
Diagnosis¶
Step 1: Check the standby nodes:
curl -s http://rppam-02.corp.local:7101/api/v1/admin/cluster/status \
-H "Authorization: Bearer $ADMIN_JWT" | jq .
Step 2: Common failover blockers:
| Blocker | Cause | Solution |
|---|---|---|
| No quorum | Not enough nodes for majority vote | Bring additional nodes online |
| Redis unreachable | Cannot acquire distributed lock | Fix Redis connectivity |
| Standby not healthy | Standby has its own issues (DB down, etc.) | Fix the standby's health first |
| VIP not moving | VIP failover script failed | Check VIP configuration — see VIP Failover |
Step 3: Force a failover (if automatic failover is stuck):
PowerShell:
Invoke-RestMethod -Uri "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" `
-Method POST `
-Headers @{ Authorization = "Bearer $adminJwt" }
curl:
curl -s -X POST "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" \
-H "Authorization: Bearer $ADMIN_JWT" | jq .
This forces the target node to attempt promotion to leader.
Step 4: After the old leader comes back online, it should automatically rejoin as a standby. If it tries to reclaim leadership:
# Demote the old leader (run on the old leader node)
curl -s -X POST "http://rppam-01.corp.local:7101/api/v1/admin/cluster/demote" \
-H "Authorization: Bearer $ADMIN_JWT" | jq .
VIP Not Moving During Failover¶
Symptoms¶
- Failover completed (new leader elected) but clients still cannot connect
- VIP (Virtual IP) still points to the old leader
Diagnosis¶
Step 1: Check which node currently holds the VIP:
Windows:
Linux:
Step 2: Check the VIP failover script logs on the new leader node. RP-PAM executes a script during failover:
Windows:
Linux:
Step 3: Manually move the VIP if needed — see VIP Failover Configuration.
Troubleshooting Summary¶
| Problem | First Check | Second Check |
|---|---|---|
| Nodes unreachable | Port 7101 + 7102 connectivity | Firewall rules |
| No leader elected | Quorum met? (enough nodes) | Redis connectivity |
| Split-brain | Network between nodes | Witness node configured? |
| Redis errors | redis-cli ping |
Connection string in config |
| Failover stuck | Standby health | Redis lock availability |
| VIP not moving | VIP failover script logs | Manual VIP check with ip addr |
Next Steps¶
- HA Multi-Node Setup — HA architecture and setup
- Redis Configuration for HA — Redis setup guide
- VIP Failover Configuration — VIP setup and testing
- Failover Testing — How to test your HA setup
RP-PAM v1.0.0 — Copyright 2026 Ravenphyre. All rights reserved.