Skip to content

HA and Cluster Troubleshooting

Section: Troubleshooting | Article 50
Audience: System Administrators
Last Updated: 2026-04-07


Overview

This article covers issues specific to RP-PAM High Availability (HA) multi-node deployments. For general troubleshooting, see General Troubleshooting. For HA setup, see HA Multi-Node Setup.


Checking Cluster Status

Before troubleshooting, check the current cluster state.

Via REST API

PowerShell:

Invoke-RestMethod -Uri "https://rppam.corp.local:7101/api/v1/admin/cluster/status" `
    -Headers @{ Authorization = "Bearer $adminJwt" } | ConvertTo-Json -Depth 3

curl:

curl -s "https://rppam.corp.local:7101/api/v1/admin/cluster/status" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

Healthy cluster response:

{
  "clusterId": "cluster-001",
  "status": "healthy",
  "leaderNodeId": "node-1",
  "nodes": [
    {
      "nodeId": "node-1",
      "hostname": "rppam-01.corp.local",
      "role": "leader",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:30:00Z"
    },
    {
      "nodeId": "node-2",
      "hostname": "rppam-02.corp.local",
      "role": "standby",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:30:01Z"
    },
    {
      "nodeId": "node-3",
      "hostname": "rppam-03.corp.local",
      "role": "standby",
      "status": "healthy",
      "lastHeartbeat": "2026-04-07T14:29:59Z"
    }
  ],
  "quorum": true,
  "redisStatus": "connected"
}

Per-Node Health

Check each node individually:

# Node 1
curl -s http://rppam-01.corp.local:7101/system/health/ping | jq .

# Node 2
curl -s http://rppam-02.corp.local:7101/system/health/ping | jq .

# Node 3
curl -s http://rppam-03.corp.local:7101/system/health/ping | jq .

Nodes Cannot Communicate

Symptoms

  • Cluster status shows nodes as "unreachable"
  • Heartbeat timestamps are stale (more than 30 seconds old)
  • Log shows NodeCommunicationException or Heartbeat timeout for node

Diagnosis

Step 1: Test inter-node connectivity.

From Node 1 to Node 2:

# PowerShell
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7101
Test-NetConnection -ComputerName "rppam-02.corp.local" -Port 7102  # Cluster communication port
# Linux
nc -zv rppam-02.corp.local 7101
nc -zv rppam-02.corp.local 7102

Step 2: Check firewall rules. RP-PAM uses two ports for HA:

Port Purpose Direction
7101 API and health checks Inbound from clients and other nodes
7102 Cluster communication (gossip, heartbeats, leader election) Inbound from other nodes only

Windows:

Get-NetFirewallRule -DisplayName "*RP-PAM*" | Format-Table DisplayName, Enabled, Direction

Linux:

sudo ss -tlnp | grep -E "7101|7102"
sudo ufw status          # Ubuntu
sudo firewall-cmd --list-ports  # RHEL

Step 3: Verify cluster configuration. Each node must know about the other nodes:

Check rppam.config on each node:

{
  "cluster": {
    "nodeId": "node-1",
    "clusterPort": 7102,
    "peers": [
      "rppam-02.corp.local:7102",
      "rppam-03.corp.local:7102"
    ]
  }
}

Ensure the peers list is correct and consistent across all nodes.


Leader Election Failing

Symptoms

  • Cluster status shows "leaderNodeId": null
  • All nodes report role as "standby" or "candidate"
  • Log shows Leader election timed out or No quorum available

Cause

Leader election requires a quorum (majority of nodes). If more than half the nodes are down or unreachable, no leader can be elected.

Total Nodes Quorum Requirement Tolerated Failures
2 2 (both must agree) 0
3 2 1
5 3 2

Solution

Step 1: Ensure enough nodes are online:

for node in rppam-01 rppam-02 rppam-03; do
  echo -n "$node: "
  curl -s -o /dev/null -w "%{http_code}" http://$node.corp.local:7101/system/health/ping
  echo
done

Step 2: If nodes are online but election still fails, check Redis connectivity (Redis is required for distributed locking):

redis-cli -h redis.corp.local -p 6379 ping

Step 3: If a node is stuck in "candidate" state, restart it:

sudo systemctl restart rppam

Step 4: If the cluster is in a persistent election loop, check for clock skew between nodes. Leader election uses timestamps for lease management:

# Check time on each node
for node in rppam-01 rppam-02 rppam-03; do
  echo -n "$node: "
  ssh $node date -u
done

Ensure NTP is synchronized across all nodes (maximum acceptable skew: 5 seconds).


Split-Brain

Symptoms

  • Two nodes both claim to be the leader
  • Clients get inconsistent responses depending on which node they reach
  • Log shows Split-brain detected warnings

Cause

Split-brain occurs when network partitioning causes nodes to lose communication with each other but remain individually healthy. Each partition may elect its own leader.

Immediate Response

Step 1: Identify the situation:

# Check each node's view of the cluster
for node in rppam-01 rppam-02 rppam-03; do
  echo "=== $node ==="
  curl -s http://$node.corp.local:7101/api/v1/admin/cluster/status \
    -H "Authorization: Bearer $ADMIN_JWT" | jq '{leaderNodeId, quorum}'
done

Step 2: RP-PAM's split-brain protection uses the witness node (or Redis-based fencing) to resolve conflicts. If the protection did not trigger:

  1. Stop the minority partition (the side with fewer nodes):

    # On the minority node(s)
    sudo systemctl stop rppam
    

  2. Verify the majority partition is healthy:

    curl -s http://rppam-01.corp.local:7101/api/v1/admin/cluster/status \
      -H "Authorization: Bearer $ADMIN_JWT" | jq .
    

  3. Fix the network issue that caused the partition.

  4. Restart the stopped node(s):

    sudo systemctl start rppam
    
    They will rejoin as standbys.

Prevention

  • Use a witness node for 2-node clusters — see Witness Node Setup
  • Use 3 or 5 nodes for automatic quorum-based protection
  • Ensure reliable networking between nodes (same subnet recommended)

Redis Connection Issues

Symptoms

  • Cluster operations fail or are degraded
  • Log shows RedisConnectionException or Redis connection refused
  • Session data not shared between nodes (users logged out when hitting a different node)

Diagnosis

Step 1: Test Redis connectivity from the RP-PAM server:

# PowerShell (using Test-NetConnection)
Test-NetConnection -ComputerName "redis.corp.local" -Port 6379
# Linux
redis-cli -h redis.corp.local -p 6379 ping
# Expected: PONG

Step 2: If Redis requires authentication:

redis-cli -h redis.corp.local -p 6379 -a "your-redis-password" ping

Step 3: Check Redis configuration in rppam.config:

{
  "cluster": {
    "redis": {
      "connectionString": "redis.corp.local:6379,password=your-redis-password,ssl=false,abortConnect=false",
      "instanceName": "rppam:"
    }
  }
}

Common Redis Issues

Issue Cause Solution
Connection refused Redis not running or wrong port Start Redis; verify port
Authentication failed Wrong password Update password in rppam.config
SSL/TLS error ssl=true but Redis doesn't have TLS Set ssl=false or configure Redis TLS
Timeout Redis overloaded or network latency Check Redis memory usage; optimize eviction policy
Data lost after Redis restart Redis not configured for persistence Enable AOF or RDB persistence in redis.conf

Check Redis health:

redis-cli -h redis.corp.local info server | head -10
redis-cli -h redis.corp.local info memory | head -10
redis-cli -h redis.corp.local info clients | head -10


Failover Not Completing

Symptoms

  • Leader node went down but standby did not promote
  • Clients receive errors or timeouts instead of being served by a standby
  • Log shows Failover initiated but never Failover completed

Diagnosis

Step 1: Check the standby nodes:

curl -s http://rppam-02.corp.local:7101/api/v1/admin/cluster/status \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

Step 2: Common failover blockers:

Blocker Cause Solution
No quorum Not enough nodes for majority vote Bring additional nodes online
Redis unreachable Cannot acquire distributed lock Fix Redis connectivity
Standby not healthy Standby has its own issues (DB down, etc.) Fix the standby's health first
VIP not moving VIP failover script failed Check VIP configuration — see VIP Failover

Step 3: Force a failover (if automatic failover is stuck):

PowerShell:

Invoke-RestMethod -Uri "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" `
    -Method POST `
    -Headers @{ Authorization = "Bearer $adminJwt" }

curl:

curl -s -X POST "https://rppam-02.corp.local:7101/api/v1/admin/cluster/promote" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

This forces the target node to attempt promotion to leader.

Step 4: After the old leader comes back online, it should automatically rejoin as a standby. If it tries to reclaim leadership:

# Demote the old leader (run on the old leader node)
curl -s -X POST "http://rppam-01.corp.local:7101/api/v1/admin/cluster/demote" \
  -H "Authorization: Bearer $ADMIN_JWT" | jq .

VIP Not Moving During Failover

Symptoms

  • Failover completed (new leader elected) but clients still cannot connect
  • VIP (Virtual IP) still points to the old leader

Diagnosis

Step 1: Check which node currently holds the VIP:

Windows:

# On each node
Get-NetIPAddress | Where-Object { $_.IPAddress -eq "10.0.1.100" }

Linux:

# On each node
ip addr show | grep "10.0.1.100"

Step 2: Check the VIP failover script logs on the new leader node. RP-PAM executes a script during failover:

Windows:

Get-Content "C:\ProgramData\Ravenphyre\RP-PAM\Logs\vip-failover.log" -Tail 20

Linux:

cat /var/log/rppam/vip-failover.log

Step 3: Manually move the VIP if needed — see VIP Failover Configuration.


Troubleshooting Summary

Problem First Check Second Check
Nodes unreachable Port 7101 + 7102 connectivity Firewall rules
No leader elected Quorum met? (enough nodes) Redis connectivity
Split-brain Network between nodes Witness node configured?
Redis errors redis-cli ping Connection string in config
Failover stuck Standby health Redis lock availability
VIP not moving VIP failover script logs Manual VIP check with ip addr

Next Steps


RP-PAM v1.0.0 — Copyright 2026 Ravenphyre. All rights reserved.