Field Manual

Solana Validator Troubleshooting Guide (2025)

Last Updated: December 2025 • Technical Level: Expert

Introduction

Running a Solana validator is not “set and forget.” Even with good hardware, validators fail for predictable reasons: disk IO saturation, snapshot issues, ledger bloat, RPC overload, voting latency, or network hiccups.

This guide is designed as a field manual — not marketing content. If your validator misses slots, drops out of the leader schedule, fails to vote, or randomly crashes, this page will help you identify what’s actually broken, why it happens, and how to fix it.

How to Use This Guide

Each section follows the same structure:

Symptom: What you observe.
Root Cause: What’s actually happening.
How to Confirm: Commands/logs to check.
Fix: Exact actions to take.
Prevention: How to avoid it long-term.

1. Validator Is Missing Slots or Votes

Symptom

Explorer shows missed leader slots. Vote credit rate drops. Commission earnings decline. Validator looks “healthy” but underperforms.

Root Causes

Solana is extremely sensitive to latency spikes, not average performance.

CPU Bottlenecks: Poor single-thread performance impacting PoH generation.
Disk IO Saturation: Accounts DB lookups waiting heavily on NVMe.
Ledger Lag: Slow banking stage processing.

How to Confirm

solana catchup

Check for high Root slot distance and Vote latency.

Also monitor hardware:

iostat -x 1
top

If NVMe %util is consistently > 80% or await > 5ms, your disk is the bottleneck.

The Fix

Ensure you are on bare metal, not a VPS or Cloud Instance.
Move ledger and accounts to separate physical NVMe drives.
Disable unnecessary RPC services (metrics/geyser).

🚀 Solution: Upgrade Hardware

If your IO wait causes missed slots, you need dedicated lanes. We recommend Cherry Servers for their dedicated NVMe Gen4 arrays.

View Cherry Servers Inventory →

2. Validator Falls Behind (Never Catches Up)

Symptom

solana catchup never reaches "caught up". Slot lag keeps increasing. Node restarts don't fix it.

Root Causes

Snapshot Download Slow: Generating the initial state takes longer than the network creates new blocks.
Ledger Corruption: Bad state requiring a reset.
Insufficient Throughput: Hardware cannot handle the rate of block replay (TPS).

How to Confirm

journalctl -u solana-validator -n 200

Look for snapshot unpack delays or RocksDB stalls.

The Fix

Stop the validator and force a fresh snapshot.

systemctl stop solana-validator
rm -rf /mnt/ledger/*
solana-validator --no-wait-for-vote-to-start-leader

3. Validator Crashes Randomly

Symptom

Validator exits without warning. Systemd log shows random stops. Downtime events visible on explorer.

Root Causes

Out-of-Memory (OOM): RAM exhausted, Linux kernel kills the content.
Kernel Limits: Open file descriptors exceeded.
Disk Full: Ledger filled the drive.

How to Confirm

dmesg | grep -i oom
df -h

The Fix

Ensure 256GB RAM minimum.
Disable swap (swapping kills performance).
Increase file limits: ulimit -n 1000000.

4. High Disk IO Wait (iowait)

Symptom

CPU looks idle but performance is bad. iowait > 15%. Vote latency spikes.

Root Causes

Ledger and accounts DB sharing the same disk.
Slow NVMe (Consumer grade or worn out).
Filesystem misconfiguration (not using noatime).

The Fix

Separate your disks physically.

Disk 1: /mnt/ledger (Write Heavy)
Disk 2: /mnt/accounts (Read/Write Random Heavy)

Avoid RAID 5/6. Use RAID 0 only if you have redundancy elsewhere.

5. Validator Votes But Earns Low Rewards

Symptom

Validator is online. Votes appear. Rewards are lower than expected.

The Fix

Reduce Commission: Drop to 0-5% temporarily to attract delegations.
Improve Uptime: Consistency is key.
Marketing: Publish performance stats. Use Stakewiz to verify your datcenter.

6. RPC Overload Causes Validator Lag

Symptom

Validator lags during high RPC traffic bursts. Clients timeout. Votes delayed.

Root Causes

Running public RPC services (port 8899 exposed) on the same machine as your validator voting identity.

The Fix

Strict Separation. Your validator should effectively be firewalled to only talk to the gossip network. Use a separate machine for RPC traffic.

Read our RPC Node Setup Guide →

7. Snapshot Download Is Extremely Slow

Symptom

Initial sync takes hours or days. Validator never catches up.

The Fix

Change Source: Use --no-snapshot-fetch if you have local data.
Trusted Peers: Use known fast boot nodes or trusted download sites (Jito, etc).
Network: Ensure you have 1Gbps+ unmetered (Cherry Servers provides this).

8. Validator Gets Delinquent

Symptom

Explorer shows “delinquent”. Voting stops. Rewards drop to zero.

Root Causes

Extended downtime, missed restart after crash, or incompatible cluster restart.

The Fix

Restart standard service immediately.
Ensure specific flags like --expected-shred-version match the cluster.
Wait for full catchup before checking leader logs.

9. Software Version Issues

Symptom

Node behaves differently after upgrade. Unexpected errors in logs.

The Fix

Read Release Notes: Solana updates often deprecate flags.
Testnet First: Always upgrade your Testnet node before Mainnet.
Downgrade: Keep the previous binary version handy for instant rollbacks.

10. Monitoring & Alerts (Non-Optional)

If you don’t have alerts, you’re flying blind.

Minimum Stack

Prometheus + Grafana: Visualizing trends.
Slot Lag Alert: PagerDuty/Telegram if > 50 slots behind.
Disk Usage Alert: Warning at 75%, Critical at 90%.
Vote Credits: Alert if zero credits for 5 minutes.

Final Operator Advice

Most validator failures are not bugs. They are underpowered hardware, poor storage design, lack of monitoring, or bad assumptions about VPS reliability.

Solana rewards operators who treat validators like production infrastructure, not side projects.

Need to upgrade? Deploy professional hardware on Cherry Servers →