HA Multi-Node Setup¶
Section: High Availability | Article 15
Audience: System Administrators
Last Updated: 2026-04-06
Overview¶
RP-PAM supports high-availability (HA) clustering with two or more nodes. In an HA deployment, one node is the primary (leader) and the others are standby nodes. If the primary fails, a standby automatically promotes to leader within seconds. Users experience no downtime.
This article walks through a complete 2-node HA setup from scratch. If you need three or more nodes, follow the 2-node guide first, then see Adding More Nodes at the end.
Prerequisites¶
Before starting, ensure you have:
| Requirement | Detail |
|---|---|
| Two servers | Both meeting the System Requirements for a medium deployment (8 cores, 16 GB RAM each) |
| RP-PAM installed on both | Same version on both nodes (Windows install or Linux install) |
| Redis instance | Redis 6.2 or later, accessible from both nodes (see Redis Setup) |
| Network connectivity | Ports 7001-7012, 5201, and 6379 open between nodes (see Network Requirements) |
| Database | Either a shared external database or local databases on each node (see Database Mode) |
| License | Enterprise or MSP tier license (Standard tier does not include HA) |
| Load balancer | nginx, HAProxy, or cloud load balancer for production use (see Load Balancer Example) |
Naming Convention¶
Throughout this guide:
| Name | IP Address | Role |
|---|---|---|
| node1 | 10.0.1.10 |
Primary |
| node2 | 10.0.1.11 |
Standby |
| redis | 10.0.1.20 |
Redis server (can also run on node1) |
| db-server | 10.0.1.30 |
External database (if using external-shared mode) |
Replace these with your actual hostnames and IP addresses.
Step 1: Verify Network Connectivity¶
Before configuring anything, confirm that both nodes can reach each other and the supporting infrastructure.
Windows PowerShell (run on each node):
# From node1, test connectivity to node2
Test-NetConnection -ComputerName 10.0.1.11 -Port 5201
Test-NetConnection -ComputerName 10.0.1.11 -Port 7001
# Test Redis
Test-NetConnection -ComputerName 10.0.1.20 -Port 6379
# Test database (MSSQL example)
Test-NetConnection -ComputerName 10.0.1.30 -Port 1433
Linux (run on each node):
# From node1, test connectivity to node2
nc -zv 10.0.1.11 5201
nc -zv 10.0.1.11 7001
# Test Redis
nc -zv 10.0.1.20 6379
# Test database (PostgreSQL example)
nc -zv 10.0.1.30 5432
All tests must succeed before proceeding. If any fail, review your firewall rules.
Step 2: Install and Configure Redis¶
Redis is required for all HA deployments. It serves as the distributed cache and coordinates leader election between nodes.
Option A: Redis on a Dedicated Server (Recommended for Production)¶
Linux (on the Redis server):
# Ubuntu / Debian
sudo apt update && sudo apt install -y redis-server
# RHEL / CentOS
sudo dnf install -y redis
Edit /etc/redis/redis.conf:
# Bind to all interfaces (restrict via firewall)
bind 0.0.0.0
# Set a strong password
requirepass YOUR_REDIS_PASSWORD_HERE
# Enable persistence
appendonly yes
appendfilename "appendonly.aof"
# Optional: enable TLS (recommended for production)
# tls-port 6379
# port 0
# tls-cert-file /etc/redis/tls/redis.crt
# tls-key-file /etc/redis/tls/redis.key
# tls-ca-cert-file /etc/redis/tls/ca.crt
sudo systemctl enable redis-server
sudo systemctl restart redis-server
sudo systemctl status redis-server
Windows (on the Redis server):
Redis does not officially support Windows. For Windows environments, use one of:
- Memurai (Redis-compatible, native Windows): download from https://www.memurai.com/
- Redis in WSL2: run Redis inside Windows Subsystem for Linux
- Redis in Docker Desktop: docker run -d --name redis -p 6379:6379 redis:7 --requirepass YOUR_REDIS_PASSWORD_HERE
The Docker approach is simplest for Windows:
docker run -d --name rppam-redis `
-p 6379:6379 `
--restart unless-stopped `
redis:7 --requirepass YOUR_REDIS_PASSWORD_HERE
Option B: Redis Co-Located on Node 1¶
For smaller deployments, Redis can run on the same server as Node 1. Use the same installation steps above but install on Node 1 instead of a dedicated server. Ensure Node 2 can reach Node 1 on port 6379.
Verify Redis¶
Step 3: Choose Your Database Mode¶
RP-PAM supports two database modes for HA clusters. Choose the one that fits your environment.
Option A: external-shared (Recommended)¶
Both nodes connect to the same external database server. This is the simplest approach and is recommended for most deployments.
| Aspect | Detail |
|---|---|
| How it works | A single MSSQL or PostgreSQL instance serves both nodes. Both read and write to the same database. |
| Advantages | Simple setup. No data replication to manage. Standard database HA (AlwaysOn, replication) can protect the database itself. |
| Disadvantages | Database is a single point of failure unless you also configure database-level HA. |
| Best for | Organisations that already have enterprise database infrastructure with their own HA and backup. |
Option B: local-sync (Advanced)¶
Each node has its own local database. RP-PAM handles replication between nodes automatically.
| Aspect | Detail |
|---|---|
| How it works | Each node runs its own database instance. When the primary writes data, the write is replicated to N/2+1 nodes (quorum) before it is considered committed. |
| Advantages | No external database dependency. A compromised or failed node can be isolated without affecting the cluster — the other nodes continue operating with their own data. |
| Disadvantages | More complex setup. Each node needs sufficient disk space for a full database copy. Write latency is slightly higher due to quorum replication. |
| Best for | Air-gapped environments, high-security deployments, or organisations that prefer self-contained nodes. |
| Quorum rule | Writes must reach N/2+1 nodes. For a 2-node cluster, that means both nodes must acknowledge the write. For a 3-node cluster, 2 of 3 must acknowledge. |
| Compromised node | If a node is suspected of compromise, it can be isolated (removed from the cluster) immediately. The remaining nodes retain a complete, consistent copy of all data. |
Step 4: Configure Node 1 (Primary)¶
Edit rppam.config on Node 1.
Using external-shared Database Mode¶
Windows: C:\ProgramData\Ravenphyre\RP-PAM\rppam.config
Linux: /etc/rppam/rppam.config
{
"node": {
"nodeName": "node1",
"grpcPortBase": 7001
},
"database": {
"globalConnectionString": "Server=10.0.1.30,1433;Database=RpPam_Global;User Id=rppam_svc;Password=YOUR_DB_PASSWORD;Encrypt=True;TrustServerCertificate=False",
"dbType": "mssql"
},
"databaseSync": {
"mode": "external-shared",
"nodeRole": "primary",
"peerEndpoints": [
"https://10.0.1.11:5201"
],
"writeQuorum": 1
},
"redis": {
"enabled": true,
"connectionString": "10.0.1.20:6379,password=YOUR_REDIS_PASSWORD_HERE,ssl=false",
"keyPrefix": "rppam:",
"tlsEnabled": false
},
"cluster": {
"leaderLockTtlSeconds": 30,
"leaderRenewalIntervalSeconds": 10,
"outboxPollIntervalSeconds": 5,
"heartbeatIntervalSeconds": 5,
"virtualIp": "",
"vipInterface": ""
}
}
Using local-sync Database Mode¶
If using local-sync, change the database and databaseSync sections:
{
"node": {
"nodeName": "node1",
"grpcPortBase": 7001
},
"database": {
"globalConnectionString": "Server=127.0.0.1,1433;Database=RpPam_Global;User Id=rppam_svc;Password=YOUR_DB_PASSWORD;Encrypt=True;TrustServerCertificate=True",
"dbType": "mssql"
},
"databaseSync": {
"mode": "local-sync",
"nodeRole": "primary",
"peerEndpoints": [
"https://10.0.1.11:5201"
],
"writeQuorum": 2,
"backup": {
"enabled": true,
"schedule": "0 2 * * *",
"retentionDays": 30
}
},
"redis": {
"enabled": true,
"connectionString": "10.0.1.20:6379,password=YOUR_REDIS_PASSWORD_HERE,ssl=false",
"keyPrefix": "rppam:",
"tlsEnabled": false
},
"cluster": {
"leaderLockTtlSeconds": 30,
"leaderRenewalIntervalSeconds": 10,
"outboxPollIntervalSeconds": 5,
"heartbeatIntervalSeconds": 5,
"virtualIp": "",
"vipInterface": ""
}
}
Note: For a 2-node
local-synccluster,writeQuorummust be2(both nodes). This ensures no data is lost if either node fails. For 3+ nodes, setwriteQuorumto2(majority of 3).
Step 5: Configure Node 2 (Standby)¶
Edit rppam.config on Node 2.
Using external-shared Database Mode¶
{
"node": {
"nodeName": "node2",
"grpcPortBase": 7001
},
"database": {
"globalConnectionString": "Server=10.0.1.30,1433;Database=RpPam_Global;User Id=rppam_svc;Password=YOUR_DB_PASSWORD;Encrypt=True;TrustServerCertificate=False",
"dbType": "mssql"
},
"databaseSync": {
"mode": "external-shared",
"nodeRole": "standby",
"peerEndpoints": [
"https://10.0.1.10:5201"
],
"writeQuorum": 1
},
"redis": {
"enabled": true,
"connectionString": "10.0.1.20:6379,password=YOUR_REDIS_PASSWORD_HERE,ssl=false",
"keyPrefix": "rppam:",
"tlsEnabled": false
},
"cluster": {
"leaderLockTtlSeconds": 30,
"leaderRenewalIntervalSeconds": 10,
"outboxPollIntervalSeconds": 5,
"heartbeatIntervalSeconds": 5,
"virtualIp": "",
"vipInterface": ""
}
}
Using local-sync Database Mode¶
{
"node": {
"nodeName": "node2",
"grpcPortBase": 7001
},
"database": {
"globalConnectionString": "Server=127.0.0.1,1433;Database=RpPam_Global;User Id=rppam_svc;Password=YOUR_DB_PASSWORD;Encrypt=True;TrustServerCertificate=True",
"dbType": "mssql"
},
"databaseSync": {
"mode": "local-sync",
"nodeRole": "standby",
"peerEndpoints": [
"https://10.0.1.10:5201"
],
"writeQuorum": 2,
"backup": {
"enabled": true,
"schedule": "0 2 * * *",
"retentionDays": 30
}
},
"redis": {
"enabled": true,
"connectionString": "10.0.1.20:6379,password=YOUR_REDIS_PASSWORD_HERE,ssl=false",
"keyPrefix": "rppam:",
"tlsEnabled": false
},
"cluster": {
"leaderLockTtlSeconds": 30,
"leaderRenewalIntervalSeconds": 10,
"outboxPollIntervalSeconds": 5,
"heartbeatIntervalSeconds": 5,
"virtualIp": "",
"vipInterface": ""
}
}
Key differences from Node 1:
- nodeName is "node2"
- nodeRole is "standby"
- peerEndpoints points to Node 1's address (10.0.1.10) instead of Node 2's
Step 6: Start the Cluster¶
Start the primary node first, then the standby.
Windows PowerShell¶
# On Node 1 (primary) — start first
Start-Service RpPam
# Wait for Node 1 to become healthy
do {
Start-Sleep -Seconds 2
$health = Invoke-RestMethod -Uri "http://localhost:7101/system/health/ping" -ErrorAction SilentlyContinue
} while ($health.status -ne "healthy")
Write-Host "Node 1 is healthy"
# On Node 2 (standby) — start second
Start-Service RpPam
Linux¶
# On Node 1 (primary) — start first
sudo systemctl start rppam
# Wait for Node 1 to become healthy
until curl -sf http://localhost:7101/system/health/ping | grep -q '"status":"healthy"'; do
sleep 2
done
echo "Node 1 is healthy"
# On Node 2 (standby) — start second
sudo systemctl start rppam
Step 7: Verify Cluster Status¶
Once both nodes are running, verify the cluster is healthy.
Windows PowerShell¶
# Check cluster status from either node
$cluster = Invoke-RestMethod -Uri "http://localhost:7101/system/health/cluster" `
-Headers @{ Authorization = "Bearer $adminJwt" }
$cluster | ConvertTo-Json -Depth 5
Expected response:
{
"clusterHealthy": true,
"leaderNode": "node1",
"nodes": [
{
"nodeName": "node1",
"role": "primary",
"status": "healthy",
"lastHeartbeat": "2026-04-06T10:30:00Z",
"databaseMode": "external-shared"
},
{
"nodeName": "node2",
"role": "standby",
"status": "healthy",
"lastHeartbeat": "2026-04-06T10:30:02Z",
"databaseMode": "external-shared"
}
],
"redisConnected": true,
"writeQuorum": 1,
"quorumMet": true
}
Linux (curl)¶
What to Check¶
| Field | Expected Value | Problem if Wrong |
|---|---|---|
clusterHealthy |
true |
One or more nodes unhealthy — check logs |
leaderNode |
"node1" |
If empty, leader election has not completed — check Redis |
Both nodes status |
"healthy" |
If "unreachable", check firewall rules between nodes |
redisConnected |
true |
If false, Redis is down or unreachable |
quorumMet |
true |
If false, not enough nodes are online to satisfy writeQuorum |
Step 8: Configure a Load Balancer¶
In production, users should connect through a load balancer rather than directly to individual nodes. This ensures automatic failover if the primary becomes unavailable.
nginx Example¶
Install nginx on a separate server or on one of the RP-PAM nodes.
upstream rppam_cluster {
# Primary node — preferred
server 10.0.1.10:7101 weight=5;
# Standby node — failover
server 10.0.1.11:7101 backup;
}
server {
listen 443 ssl;
server_name pam.corp.local;
ssl_certificate /etc/nginx/ssl/pam.corp.local.crt;
ssl_certificate_key /etc/nginx/ssl/pam.corp.local.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Health check — route traffic only to healthy nodes
location /system/health/ping {
proxy_pass http://rppam_cluster;
proxy_connect_timeout 3s;
proxy_read_timeout 5s;
}
location / {
proxy_pass https://rppam_cluster;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support (for live session monitoring)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_connect_timeout 10s;
proxy_read_timeout 300s;
proxy_send_timeout 60s;
}
}
After installing nginx:
Users now connect to https://pam.corp.local instead of individual node addresses.
Adding More Nodes (3+)¶
Adding a third (or fourth, fifth, etc.) node follows the same pattern as Node 2.
Step-by-Step¶
-
Install RP-PAM on the new server using the same version as the existing nodes.
-
Edit
rppam.configon the new node. UsenodeRole: "standby"and list all other nodes inpeerEndpoints: -
Update existing nodes. Add the new node's endpoint to
peerEndpointson every existing node. Restart each node for the change to take effect (rolling restart — one at a time).On Node 1 and Node 2, add
"https://10.0.1.12:5201"to theirpeerEndpointsarrays. -
Update the load balancer to include the new node.
-
Start the new node:
-
Verify cluster status — all three nodes should appear:
Write Quorum for 3+ Nodes¶
| Cluster Size | Recommended writeQuorum |
Meaning |
|---|---|---|
| 2 nodes | 2 (local-sync) or 1 (external-shared) |
Both nodes must confirm (local-sync); or single DB handles it (external-shared) |
| 3 nodes | 2 |
Majority. Cluster survives 1 node failure. |
| 5 nodes | 3 |
Majority. Cluster survives 2 node failures. |
For external-shared mode, writeQuorum controls cluster-level acknowledgement only. The external database handles its own consistency.
DR Node (Disaster Recovery)¶
A DR node is a special standby node located in a separate data centre or region. It receives data via asynchronous replication (not synchronous quorum), so it does not add latency to normal writes.
Key Differences from a Standard Standby¶
| Aspect | Standard Standby | DR Node |
|---|---|---|
| Replication | Synchronous (participates in quorum) | Asynchronous (does not participate in quorum) |
| Latency impact | Must be low-latency to primary | Can be high-latency (cross-region) |
| Automatic failover | Yes — can be promoted to leader | No — must be manually promoted |
| Purpose | Local HA | Disaster recovery for regional/site failure |
Configuring a DR Node¶
{
"node": {
"nodeName": "dr-node",
"grpcPortBase": 7001
},
"databaseSync": {
"mode": "local-sync",
"nodeRole": "dr",
"peerEndpoints": [
"https://10.0.1.10:5201",
"https://10.0.1.11:5201"
],
"writeQuorum": 2
}
}
The nodeRole: "dr" setting tells RP-PAM that this node:
- Receives replicated data asynchronously
- Does not participate in leader election
- Does not count toward write quorum
- Can be manually promoted to primary in a disaster
Promoting a DR Node¶
If the primary data centre is completely lost:
Windows PowerShell:
& "C:\Program Files\Ravenphyre\RP-PAM\tools\rppam-cluster.exe" promote-dr `
--node dr-node `
--confirm
Linux:
After promotion: 1. The DR node becomes the new primary 2. Update DNS or load balancer to point to the DR node 3. Verify cluster status and data integrity 4. When the original data centre recovers, add the old nodes back as standbys
Warning: There may be a small amount of data loss when promoting a DR node, limited to writes that were committed on the primary but not yet replicated to the DR node. The promotion tool reports the replication lag at the time of promotion.
Troubleshooting¶
| Problem | Likely Cause | Solution |
|---|---|---|
| "Leader election failed" | Redis unreachable | Check Redis is running and both nodes can connect |
| Node shows "unreachable" | Firewall blocking port 5201 or 7001-7012 | Open inter-node ports (see Network Requirements) |
| "Quorum not met" | Too few nodes online | Start additional nodes; check writeQuorum setting |
| Standby never syncs (local-sync) | peerEndpoints misconfigured |
Verify each node lists the other nodes' addresses, not its own |
| Split-brain warning | Redis connectivity issue | Check that all nodes can reach the same Redis instance |
| High write latency (local-sync) | Cross-network replication | Ensure nodes are on the same low-latency network, or switch to external-shared |
Next Steps¶
- Redis Configuration for HA — Advanced Redis settings, TLS, and persistence
- VIP Failover Configuration — Configure a virtual IP for automatic client failover
- Failover Testing — Test that failover works correctly
- LVS Relay Setup — For HA clusters, deploy a single LVS relay behind your load balancer so license validation survives internet or upstream outages
- HA Manager — Manage and monitor your cluster from a single tool
RP-PAM v1.0.0 — Copyright 2026 Ravenphyre. All rights reserved.