Solana Validator Troubleshooting Guide (2025)
Introduction
Running a Solana validator is not “set and forget.” Even with good hardware, validators fail for predictable reasons: disk IO saturation, snapshot issues, ledger bloat, RPC overload, voting latency, or network hiccups.
This guide is designed as a field manual — not marketing content. If your validator misses slots, drops out of the leader schedule, fails to vote, or randomly crashes, this page will help you identify what’s actually broken, why it happens, and how to fix it.
How to Use This Guide
Each section follows the same structure:
- Symptom: What you observe.
- Root Cause: What’s actually happening.
- How to Confirm: Commands/logs to check.
- Fix: Exact actions to take.
- Prevention: How to avoid it long-term.
1. Validator Is Missing Slots or Votes
Explorer shows missed leader slots. Vote credit rate drops. Commission earnings decline. Validator looks “healthy” but underperforms.
Root Causes
Solana is extremely sensitive to latency spikes, not average performance.
- CPU Bottlenecks: Poor single-thread performance impacting PoH generation.
- Disk IO Saturation: Accounts DB lookups waiting heavily on NVMe.
- Ledger Lag: Slow banking stage processing.
How to Confirm
solana catchup
Check for high Root slot distance and Vote latency.
Also monitor hardware:
iostat -x 1
top
If NVMe %util is consistently > 80% or await > 5ms, your disk is the
bottleneck.
The Fix
- Ensure you are on bare metal, not a VPS or Cloud Instance.
- Move ledger and accounts to separate physical NVMe drives.
- Disable unnecessary RPC services (metrics/geyser).
🚀 Solution: Upgrade Hardware
If your IO wait causes missed slots, you need dedicated lanes. We recommend Cherry Servers for their dedicated NVMe Gen4 arrays.
View Cherry Servers Inventory →2. Validator Falls Behind (Never Catches Up)
solana catchup never reaches "caught up". Slot lag keeps increasing. Node restarts don't
fix it.
Root Causes
- Snapshot Download Slow: Generating the initial state takes longer than the network creates new blocks.
- Ledger Corruption: Bad state requiring a reset.
- Insufficient Throughput: Hardware cannot handle the rate of block replay (TPS).
How to Confirm
journalctl -u solana-validator -n 200
Look for snapshot unpack delays or RocksDB stalls.
The Fix
Stop the validator and force a fresh snapshot.
systemctl stop solana-validator
rm -rf /mnt/ledger/*
solana-validator --no-wait-for-vote-to-start-leader
3. Validator Crashes Randomly
Validator exits without warning. Systemd log shows random stops. Downtime events visible on explorer.
Root Causes
- Out-of-Memory (OOM): RAM exhausted, Linux kernel kills the content.
- Kernel Limits: Open file descriptors exceeded.
- Disk Full: Ledger filled the drive.
How to Confirm
dmesg | grep -i oom
df -h
The Fix
- Ensure 256GB RAM minimum.
- Disable swap (swapping kills performance).
- Increase file limits:
ulimit -n 1000000.
4. High Disk IO Wait (iowait)
CPU looks idle but performance is bad. iowait > 15%. Vote latency spikes.
Root Causes
- Ledger and accounts DB sharing the same disk.
- Slow NVMe (Consumer grade or worn out).
- Filesystem misconfiguration (not using
noatime).
The Fix
Separate your disks physically.
- Disk 1:
/mnt/ledger(Write Heavy) - Disk 2:
/mnt/accounts(Read/Write Random Heavy)
Avoid RAID 5/6. Use RAID 0 only if you have redundancy elsewhere.
5. Validator Votes But Earns Low Rewards
Validator is online. Votes appear. Rewards are lower than expected.
The Fix
- Reduce Commission: Drop to 0-5% temporarily to attract delegations.
- Improve Uptime: Consistency is key.
- Marketing: Publish performance stats. Use Stakewiz to verify your datcenter.
6. RPC Overload Causes Validator Lag
Validator lags during high RPC traffic bursts. Clients timeout. Votes delayed.
Root Causes
Running public RPC services (port 8899 exposed) on the same machine as your validator voting identity.
The Fix
Strict Separation. Your validator should effectively be firewalled to only talk to the gossip network. Use a separate machine for RPC traffic.
7. Snapshot Download Is Extremely Slow
Initial sync takes hours or days. Validator never catches up.
The Fix
- Change Source: Use
--no-snapshot-fetchif you have local data. - Trusted Peers: Use known fast boot nodes or trusted download sites (Jito, etc).
- Network: Ensure you have 1Gbps+ unmetered (Cherry Servers provides this).
8. Validator Gets Delinquent
Explorer shows “delinquent”. Voting stops. Rewards drop to zero.
Root Causes
Extended downtime, missed restart after crash, or incompatible cluster restart.
The Fix
- Restart standard service immediately.
- Ensure specific flags like
--expected-shred-versionmatch the cluster. - Wait for full catchup before checking leader logs.
9. Software Version Issues
Node behaves differently after upgrade. Unexpected errors in logs.
The Fix
- Read Release Notes: Solana updates often deprecate flags.
- Testnet First: Always upgrade your Testnet node before Mainnet.
- Downgrade: Keep the previous binary version handy for instant rollbacks.
10. Monitoring & Alerts (Non-Optional)
If you don’t have alerts, you’re flying blind.
Minimum Stack
- Prometheus + Grafana: Visualizing trends.
- Slot Lag Alert: PagerDuty/Telegram if > 50 slots behind.
- Disk Usage Alert: Warning at 75%, Critical at 90%.
- Vote Credits: Alert if zero credits for 5 minutes.
Final Operator Advice
Most validator failures are not bugs. They are underpowered hardware, poor storage design, lack of monitoring, or bad assumptions about VPS reliability.
Solana rewards operators who treat validators like production infrastructure, not side projects.
Need to upgrade? Deploy professional hardware on Cherry Servers →