Operating Standards · Playbook 004

Solana Validator Upgrade Playbook

How to Upgrade Safely Without Downtime, Replay Failures, or Lost Rewards.

Why Validator Upgrades Are High Risk

Most Solana validators don't fail randomly. They fail during upgrades. Steady-state operation is predictable, but the upgrade window forces the system into its most vulnerable state.

Upgrades introduce a chain of high-stress events:

Full system restarts: Clearing caches and forcing a cold start of the validator process.
Snapshot replays: The intensive process of reconstructing the state from a local or remote snapshot.
Peak Disk IO: Massive read/write operations during the unpack and replay phases.
Memory Pressure: Higher RAM allocation required for initial state loading compared to idle voting.
Timing Sensitivity: The pressure to sync before missing a leader slot or falling too far behind the root.

The Operator Mindset

Treat upgrades as planned incidents, not routine maintenance. Professional operators acknowledge that every restart is a potential point of failure.

Professional Standards:

Prepare days in advance: Don't wait for the cluster percentage to hit 80% to start thinking about it.
Validate recovery paths: Know exactly where your backup snapshots are.
Measure replay behavior: Keep a log of how long previous versions took to start.
Assume failure: Have a rollback binary ready before you stop the service.

"If your upgrade plan is 'restart and hope,' you're gambling with rewards."

Pre-Upgrade Checklist (Non-Negotiable)

Before touching the validator binary, confirm all of the following technical requirements.

Cluster Awareness

Confirm the target Solana release version and cluster-wide adoption state.
Check official upgrade window recommendations from the Solana Foundation or core devs.
Monitor cluster stability: check for TPS drops, fork rates, or network congestion.

Never upgrade during: Congestion events, large forks, or emergency patch windows unless explicitly required.

Snapshot Readiness

Upgrades almost always require a snapshot reload or partial replay. Verify:

Snapshot source availability (local cache vs. remote).
Snapshot integrity and recent timestamp.
Download and unpack speed baselines.

If snapshot handling is slow during normal operation, it will become a critical bottleneck after an upgrade restart.

Disk IO Health Check

Confirm the health of your NVMe array. Upgrades stress the disk harder than steady-state voting.

NVMe latency is stable (sub-ms).
No IO wait spikes in the hours leading up to the window.
Sufficient free space for new snapshots and ledger growth.
No shared disks or noisy neighbors (critical for cloud/VM setups).

Memory Headroom Verification

Initial replay consumes significantly more RAM than steady voting.

Confirm no swap is enabled (Swap = Death for Solana).
Verify ample free memory (30GB+ headroom recommended).
Ensure no memory growth trends are nearing physical limits.

Replay Time Baseline

You must know your numbers. Track:

How long replay takes on the current version.
How it compares to previous hardware baselines.

Increasing replay time is an early warning sign that your hardware is falling behind the cluster state growth.

Upgrade Execution: Safe Procedure

Step 1: Controlled Shutdown

Gracefully stop the validator service. Confirm the process exits cleanly and no orphaned processes or hung ledger locks remain. Never force kill (kill -9) unless absolutely required, as it increases the risk of ledger corruption.

Step 2: Binary Upgrade

Verify the checksum of the new Solana release binary.
Update binaries only. Avoid introducing config file changes at the same time.
Keep the previous version available for instant rollback.

One variable at a time. Do not mix version changes with parameter tuning.

Step 3: Restart & Initial Sync

Upon restart, tail the validator logs immediately. Monitor:

Replay progress percentages.
Disk IO latency and IOPS.
Memory usage spikes.
Root advancement and gossip connection.

If replay stalls for longer than 60 seconds without log movement: Stop immediately. Investigate. Do not "wait it out." Time lost is rewards lost.

Post-Upgrade Validation (Critical)

Operational Status: Syncing / Voting

A validator that is "running" is not necessarily healthy. You must verify these four pillars:

Replay Success

No errors in logs, no repeated retries, and no abnormal delays in starting the tower.

Vote Behavior

Votes are being submitted on time. Zero latency spikes in the first 5 minutes of voting.

Resource Metrics

CPU and memory return to steady-state baselines. Disk latency remains sub-ms.

Cluster Sync

Root advancement is normal and there are no fork issues detected via gossip.

Rollback Strategy (Don't Skip This)

Every upgrade must have a defined rollback path. If you wait until failure to think about rollback, you've already lost the battle.

Execution:

Stop the validator service.
Restore the previous known-good binary version.
Restart using a known-good snapshot.
Validate replay and voting behavior immediately.

If your infrastructure makes rollback slow or risky, it is a sign that your hardware is underpowered or your storage layer is inefficient.

Common Upgrade Failure Modes

Failure 1: Replay Takes Too Long

Root Cause: Slow NVMe throughput, IO contention, or undersized disk arrays.

Result: Validator joins the cluster too late, resulting in missed leader slots and reward penalties.

Failure 2: OOM Kill During Replay

Root Cause: Insufficient physical RAM, no memory headroom, or version-specific leaks.

Result: Continuous restart loops and extended catastrophic downtime.

Failure 3: Performance Decay

Root Cause: CPU thermal throttling, disk saturation, or config incompatibility.

Result: Silent reward loss and poor voting efficiency compared to the cluster mean.

Infrastructure Quality Determines Success

Upgrades expose the weaknesses that normal operation hides. A system that is barely keeping up will break when forced to replay the state.

Comparison: Bare Metal vs. Cloud Durante Upgrades

Cloud Instances

Throttle sustained IO during huge snapshot writes.
Shared resources cause unpredictable replay times.
Behave inconsistently under heavy CPU replay load.

Dedicated Bare Metal

Delivers predictable, maximum throughput for snapshots.
Reduces upgrade window risk via consistent low latency.
Simplifies rollback through raw hardware performance.

Operators who prioritize upgrade safety rarely stay on virtualized infrastructure for long-term production nodes.

Deploy Stable Bare Metal →

Upgrade Timing & Strategy

Speed is less important than safety. Disciplined operators follow these timing rules:

Avoid Day 0: Do not upgrade in the first hour of a release unless it's a critical security fix.
Observe Early Adopters: Watch Discord and Telegram for "canary" reports from others.
Low Activity Windows: Time your restart during periods where your leader slots are not imminent.
Staggered Deployment: If running multiple nodes, never upgrade them simultaneously.

Monitoring During the Window

Upgrades without monitoring are blind operations. You must track these metrics in real-time:

Replay Progress: Watch the log counts for state reconstruction.
Disk Latency: Monitor write latency during the snapshot extract phase.
Memory Usage: Ensure you aren't hitting 90%+ of total physical RAM.
Vote Delay: Compare your vote lag to the cluster average immediately upon restart.

Post-Upgrade Review (Every Time)

After the node is healthy, document the following to inform your future capacity planning:

Total downtime from stop to vote.
Peak resource usage (CPU/RAM/IO) during the upgrade.
Comparison with previous version performance.
Update your hardware baselines in your internal documentation.

Final Checklist

Before Calling Success:

✓

Validator fully synced and voting at 100% efficiency.

✓

Replay time remains within your historical acceptable baseline.

✓

No resource saturation (CPU/IO/RAM) post-restart.

✓

Rollback path verified and ready for the next release.

Final Thoughts

Solana validator upgrades are unavoidable. The difference between smooth upgrades and silent reward loss comes down to preparation, infrastructure quality, and disciplined execution. Treat upgrades as planned incidents—that mindset alone will save you money.