Field Manual

Solana RPC Node Troubleshooting Guide (2025)

Introduction

A Solana RPC node is not a validator.

It does not vote. It does not produce blocks. But when it fails, everything downstream breaks: wallets fail, dApps error, indexers lag, and users complain. Your infrastructure reputation dies quietly.

Most RPC failures are predictable, repeatable, and preventable. This guide exists to help you identify the real bottleneck, fix it quickly, and harden your RPC stack so it doesn’t page you again.

1. RPC Requests Timing Out

Symptom

Clients receive timeout or 504 errors. Wallets fail to load balances. dApps intermittently break. Requests succeed only after multiple retries.

Root Causes

RPC nodes are read-heavy, not consensus-heavy. Most people configure them incorrectly.

  • CPU Saturation: Single-threaded JSON RPC processing bottlenecks.
  • Disk IO Bottlenecks: Accounts DB lookups hitting IOPS limits.
  • Concurrency: Too many simultaneous connections.
  • WebSocket Overload: PubSub notifications drowning the CPU.
  • Config Errors: Validator-style flags still enabled.

How to Confirm

Monitor your system resources first:

top
htop
iostat -x 1

Check RPC latency explicitly:

curl -X POST http://localhost:8899 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","id":1,"method":"getHealth"}'

If latency spikes > 500ms under load, you are CPU or IO bound.

The Fix

  • Disable Validator Features: Ensure --no-voting is set.
  • Increase CPU Cores: RPC scales horizontally; more cores help handle concurrent requests.
  • Separate Ledger & Accounts: Put them on physically different NVMe drives.
  • Limit Connections: Use Nginx to limit max concurrent connections per IP.

Prevention

Treat RPC nodes like API servers. Scale horizontally (add more nodes) instead of vertically. Never run RPC + Validator on the same box.

2. High Memory Usage / OOM Crashes

Symptom

RPC process crashes unexpectedly. System logs show "OOM killer" invoked. Node restarts every time load increases.

Root Causes

  • Account Indexing: Loading too many accounts (or all accounts) into RAM.
  • Memory Leaks: Rare, but possible in older versions.
  • No Limits: Running without cgroup or systemd memory limits.
  • History Depth: --enable-rpc-transaction-history configured without abundant RAM.

How to Confirm

dmesg | grep -i oom
free -h

The Fix

  • Increase RAM: 128GB is the absolute minimum for a toy node. 256GB - 512GB is standard for production.
  • Disable Unused Indexes: Do not use --account-index program-id unless specifically required.
  • Reduce History: Lower the RPC history depth if you don't need archival data.

3. RPC Node Lags Behind the Cluster

Symptom

RPC returns stale block data. Slots lag behind public explorers (Solscan/Solana Beach). Clients see outdated balances.

Root Causes

  • Snapshot Replay Slow: The node takes too long to process snapshots on restart.
  • Disk Throughput: NVMe cannot keep up with the write speed of the cluster (RocksDB compaction).
  • Network Latency: Poor peering with the rest of the cluster.

How to Confirm

solana slot
solana catchup

Compare your slot height against api.mainnet-beta.solana.com.

The Fix

  • Re-sync: Delete the ledger and start from a fresh, trusted snapshot (e.g., from Jito).
  • Improve Disk: Ensure you are using Gen4 NVMe drives with high TBW and IOPS.
  • Better Data Center: Move to a location with better connectivity to the main cluster pack.

Prevention

Use high-quality NVMe (Samsung 990 Pro / Enterprise U.2). Keep accounts DB on a dedicated disk. Avoid cheap cloud providers like Hetzner Cloud (Shared vCPU) or DigitalOcean.

4. WebSocket Connections Dropping

Symptom

onAccountChange listeners disconnect. Streaming clients drop frequently. Real-time apps fail.

Root Causes

  • Socket Limits: Too many open sockets for the OS user.
  • File Descriptors: ulimit is too low.
  • Load Balancer: Nginx or HAProxy timeouts closing idle connections.

How to Confirm

ulimit -n
netstat -an | wc -l

The Fix

  • Increase Limits: Set ulimit -n 1000000 in your systemd config.
  • Limit Connections: Configure your application to limit the max WebSockets per client.
  • Timeouts: Add connection timeouts to your LB to prune dead connections gracefully.

Prevention

Use reverse proxies properly. Rate-limit abusive clients. Separate WS traffic from HTTP traffic if possible.

5. RPC Is Fast at First, Then Degrades

Symptom

Node performs well immediately after a restart. Gradually slows down over hours or days until it becomes unusable.

Root Causes

  • Memory Fragmentation: Long-running processes fragmenting RAM (especially with Jémalloc).
  • Account Cache Bloat: The in-memory cache growing too large.
  • Unbounded Requests: A slow memory leak or growing queue of pending requests.

How to Confirm

Track memory usage over time. Monitor request counts and response latency trends.

The Fix

  • Scheduled Restarts: It's common to schedule a restart every 24-48 hours during maintenance windows.
  • Tune Cache: Adjust cache sizes in system tuning.
  • Limit Fan-out: Ensure `getProgramAccounts` requests are not scanning the entire network inappropriately.

6. Disk IO Saturation on RPC Nodes

Symptom

High iowait. Latency spikes under load. CPU appears idle but the system is unresponsive.

Root Causes

  • Shared Disks: Putting Accounts DB and Ledger on the same physical drive.
  • Slow NVMe: Using consumer drives or cloud block storage (EBS/PD).
  • Excessive Indexing: Indexing too many program IDs.

How to Confirm

iostat -x 1

Look for high await times (over 5-10ms consistently).

The Fix

  • Separate Disks: Dedicate one drive to --accounts and another to --ledger.
  • Upgrade NVMe: Move to Enterprise NVMe.
  • Reduce Reads: optimize your `getProgramAccounts` filters.

Prevention

Design storage layout upfront. Avoid RAID 5/6 (write penalties). Use RAID 0 or 10 if you understand the risks.

7. RPC Node Crashes During Traffic Spikes

Symptom

Node dies during peak usage events (NFT drops, market crashes). Traffic surges cause immediate downtime.

Root Causes

  • No Rate Limiting: Infinite requests allowed per IP.
  • No Load Balancing: Direct connection to the node.
  • Single-Node Failure: No redundancy.

How to Confirm

Check access logs and request rates during the crash.

The Fix

  • Add Rate Limits: Implement `limit_req_zone` in Nginx.
  • Load Balancer: Place an LB in front of your RPC.
  • Add Nodes: Run multiple RPC nodes behind the LB.

Prevention

Assume abuse will happen. Build for worst-case traffic. Never expose a single RPC node publicly without a proxy.

8. Incorrect RPC Configuration Flags

Symptom

Poor performance despite good hardware. Unexpected behavior.

Root Causes

  • Copy-Paste Errors: Copying validator configs to RPC.
  • Deprecated Flags: Using old flags.
  • Defaults: Relying on default settings not tuned for RPC.

The Fix

  • Review Flags: Check official documentation for latest RPC flags.
  • Remove Validator Flags: Ensure no --voting flags are present.
  • Tune for Reads: Optimize for a read-heavy workload.

9. Network & Routing Issues

Symptom

Good hardware, bad performance. Inconsistent peer connections.

Root Causes

  • Poor Routing: Bad paths to Solana peers.
  • Congestion: Data center overselling bandwidth.
  • ISP: Suboptimal ISP for blockchain traffic.

The Fix

  • Change Provider: Switch to a provider with better peering (Latitude, Cherry Servers).
  • Change Region: Move to a less congested region.
  • Peer Directly: Manually peer with known good nodes.

10. Monitoring RPC Nodes (Mandatory)

If you don’t monitor RPC nodes, you’re blind. You must track at minimum:

  • Request Latency
  • Error Rates (5xx codes)
  • Slot Lag
  • Memory Usage
  • Disk IO Utilization
  • Active Connection Counts

Use Prometheus and Grafana for visualization. Set up alerts for Slot Lag > 50 and Free RAM < 10%.

Final Operator Advice

RPC nodes fail silently until users complain. By the time you hear about it, damage is already done.

Treat RPC infrastructure like production APIs, customer-facing systems, and revenue-critical services. Because that’s what they are.

Ready to upgrade your infrastructure?

Deploy High-Performance RPC Nodes on Cherry Servers →