HA and Cluster Troubleshooting¶

Section: Troubleshooting | Article 50
Audience: System Administrators
Last Updated: 2026-04-07

Overview¶

This article covers issues specific to RP-PAM High Availability (HA) multi-node deployments. For general troubleshooting, see General Troubleshooting. For HA setup, see HA Multi-Node Setup.

Checking Cluster Status¶

Before troubleshooting, check the current cluster state.

Via REST API¶

PowerShell:

Invoke-RestMethod -Uri "https://rppam.corp.local:7101/api/v1/admin/cluster/status" `
    -Headers @{ Authorization = "Bearer $adminJwt" } | ConvertTo-Json -Depth 3

curl:

curl -s "https://rppam.corp.local:7101/api/v1/admin/cluster/status" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

Healthy cluster response:

{
  "clusterId": "cluster-001",
  "status": "healthy",
  "leaderNodeId": "node-1",
  "nodes": [
    {
      "nodeId": "node-1",
      "hostname": "rppam-01.corp.local",
      "role": "leader",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:30:00Z"
    },
    {
      "nodeId": "node-2",
      "hostname": "rppam-02.corp.local",
      "role": "standby",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:30:01Z"
    },
    {
      "nodeId": "node-3",
      "hostname": "rppam-03.corp.local",
      "role": "standby",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:29:59Z"
    }
  ],
  "quorum": true,
  "redisStatus": "connected"
}

Per-Node Health¶

Check each node individually:

# Node 1
curl -s http://rppam-01.corp.local:7101/system/health/ping | jq .

# Node 2
curl -s http://rppam-02.corp.local:7101/system/health/ping | jq .

# Node 3
curl -s http://rppam-03.corp.local:7101/system/health/ping | jq .

Nodes Cannot Communicate¶

Symptoms¶

Cluster status shows nodes as "unreachable"
Heartbeat timestamps are stale (more than 30 seconds old)
Log shows NodeCommunicationException or Heartbeat timeout for node

Diagnosis¶

Step 1: Test inter-node connectivity.

From Node 1 to Node 2:

# PowerShell
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7101
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7102  # Cluster communication port

# Linux
nc -zv rppam-02.corp.local 7101
nc -zv rppam-02.corp.local 7102

Step 2: Check firewall rules. RP-PAM uses two ports for HA:

Port	Purpose	Direction
7101	API and health checks	Inbound from clients and other nodes
7102	Cluster communication (gossip, heartbeats, leader election)	Inbound from other nodes only

Windows:

Get-NetFirewallRule -DisplayName "*RP-PAM*" | Format-Table DisplayName, Enabled, Direction

Linux:

sudo ss -tlnp | grep -E "7101|7102"
sudo ufw status          # Ubuntu
sudo firewall-cmd --list-ports  # RHEL

Step 3: Verify cluster configuration. Each node must know about the other nodes:

Check rppam.config on each node:

{
  "cluster": {
    "nodeId": "node-1",
    "clusterPort": 7102,
    "peers": [
      "rppam-02.corp.local:7102",
      "rppam-03.corp.local:7102"
    ]
  }
}

Ensure the peers list is correct and consistent across all nodes.

Leader Election Failing¶

Symptoms¶

Cluster status shows "leaderNodeId": null
All nodes report role as "standby" or "candidate"
Log shows Leader election timed out or No quorum available

Cause¶

Leader election requires a quorum (majority of nodes). If more than half the nodes are down or unreachable, no leader can be elected.

Total Nodes	Quorum Requirement	Tolerated Failures
2	2 (both must agree)	0
3	2	1
5	3	2

Solution¶

Step 1: Ensure enough nodes are online:

for node in rppam-01 rppam-02 rppam-03; do
  echo -n "$node: "
  curl -s -o /dev/null -w "%{http_code}" http://$node.corp.local:7101/system/health/ping
  echo
done

Step 2: If nodes are online but election still fails, check Redis connectivity (Redis is required for distributed locking):

redis-cli -h redis.corp.local -p 6379 ping

Step 3: If a node is stuck in "candidate" state, restart it:

sudo systemctl restart rppam

Step 4: If the cluster is in a persistent election loop, check for clock skew between nodes. Leader election uses timestamps for lease management:

# Check time on each node
for node in rppam-01 rppam-02 rppam-03; do
  echo -n "$node: "
  ssh $node date -u
done

Ensure NTP is synchronized across all nodes (maximum acceptable skew: 5 seconds).

Split-Brain¶

Symptoms¶

Two nodes both claim to be the leader
Clients get inconsistent responses depending on which node they reach
Log shows Split-brain detected warnings

Cause¶

Split-brain occurs when network partitioning causes nodes to lose communication with each other but remain individually healthy. Each partition may elect its own leader.

Immediate Response¶

Step 1: Identify the situation:

# Check each node's view of the cluster
for node in rppam-01 rppam-02 rppam-03; do
  echo "=== $node ==="
  curl -s http://$node.corp.local:7101/api/v1/admin/cluster/status \
    -H "Authorization: Bearer $ADMIN_JWT" | jq '{leaderNodeId, quorum}'
done

Step 2: RP-PAM's split-brain protection uses the witness node (or Redis-based fencing) to resolve conflicts. If the protection did not trigger:

Stop the minority partition (the side with fewer nodes):

# On the minority node(s)
sudo systemctl stop rppam

Verify the majority partition is healthy:

curl -s http://rppam-01.corp.local:7101/api/v1/admin/cluster/status \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

Fix the network issue that caused the partition.
Restart the stopped node(s):
```
sudo systemctl start rppam
```
They will rejoin as standbys.

Prevention¶

Use a witness node for 2-node clusters — see Witness Node Setup
Use 3 or 5 nodes for automatic quorum-based protection
Ensure reliable networking between nodes (same subnet recommended)

Redis Connection Issues¶

Symptoms¶

Cluster operations fail or are degraded
Log shows RedisConnectionException or Redis connection refused
Session data not shared between nodes (users logged out when hitting a different node)

Diagnosis¶

Step 1: Test Redis connectivity from the RP-PAM server:

# PowerShell (using Test-NetConnection)
Test-NetConnection -ComputerName "redis.corp.local" -Port 6379

# Linux
redis-cli -h redis.corp.local -p 6379 ping
# Expected: PONG

Step 2: If Redis requires authentication:

redis-cli -h redis.corp.local -p 6379 -a "your-redis-password" ping

Step 3: Check Redis configuration in rppam.config:

{
  "cluster": {
    "redis": {
      "connectionString": "redis.corp.local:6379,password=your-redis-password,ssl=false,abortConnect=false",
      "instanceName": "rppam:"
    }
  }
}

Common Redis Issues¶

Issue	Cause	Solution
Connection refused	Redis not running or wrong port	Start Redis; verify port
Authentication failed	Wrong password	Update password in `rppam.config`
SSL/TLS error	`ssl=true` but Redis doesn't have TLS	Set `ssl=false` or configure Redis TLS
Timeout	Redis overloaded or network latency	Check Redis memory usage; optimize eviction policy
Data lost after Redis restart	Redis not configured for persistence	Enable AOF or RDB persistence in `redis.conf`

Check Redis health:

redis-cli -h redis.corp.local info server | head -10
redis-cli -h redis.corp.local info memory | head -10
redis-cli -h redis.corp.local info clients | head -10

Failover Not Completing¶

Symptoms¶

Leader node went down but standby did not promote
Clients receive errors or timeouts instead of being served by a standby
Log shows Failover initiated but never Failover completed

Diagnosis¶

Step 1: Check the standby nodes:

curl -s http://rppam-02.corp.local:7101/api/v1/admin/cluster/status \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

Step 2: Common failover blockers:

Blocker	Cause	Solution
No quorum	Not enough nodes for majority vote	Bring additional nodes online
Redis unreachable	Cannot acquire distributed lock	Fix Redis connectivity
Standby not healthy	Standby has its own issues (DB down, etc.)	Fix the standby's health first
VIP not moving	VIP failover script failed	Check VIP configuration — see VIP Failover

Step 3: Force a failover (if automatic failover is stuck):

PowerShell:

Invoke-RestMethod -Uri "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" `
    -Method POST `
    -Headers @{ Authorization = "Bearer $adminJwt" }

curl:

curl -s -X POST "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

This forces the target node to attempt promotion to leader.

Step 4: After the old leader comes back online, it should automatically rejoin as a standby. If it tries to reclaim leadership:

# Demote the old leader (run on the old leader node)
curl -s -X POST "http://rppam-01.corp.local:7101/api/v1/admin/cluster/demote" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

VIP Not Moving During Failover¶

Symptoms¶

Failover completed (new leader elected) but clients still cannot connect
VIP (Virtual IP) still points to the old leader

Diagnosis¶

Step 1: Check which node currently holds the VIP:

Windows:

# On each node
Get-NetIPAddress | Where-Object { $_.IPAddress -eq "10.0.1.100" }

Linux:

# On each node
ip addr show | grep "10.0.1.100"

Step 2: Check the VIP failover script logs on the new leader node. RP-PAM executes a script during failover:

Windows:

Get-Content "C:\ProgramData\Ravenphyre\RP-PAM\Logs\vip-failover.log" -Tail 20

Linux:

cat /var/log/rppam/vip-failover.log

Step 3: Manually move the VIP if needed — see VIP Failover Configuration.

Troubleshooting Summary¶

Problem	First Check	Second Check
Nodes unreachable	Port 7101 + 7102 connectivity	Firewall rules
No leader elected	Quorum met? (enough nodes)	Redis connectivity
Split-brain	Network between nodes	Witness node configured?
Redis errors	`redis-cli ping`	Connection string in config
Failover stuck	Standby health	Redis lock availability
VIP not moving	VIP failover script logs	Manual VIP check with `ip addr`

Next Steps¶

HA Multi-Node Setup — HA architecture and setup
Redis Configuration for HA — Redis setup guide
VIP Failover Configuration — VIP setup and testing
Failover Testing — How to test your HA setup