LogoLogo
WebsiteLinksMedia KitGitHub
  • Introduction
  • Sonic
    • Overview
    • S Token
    • Wallets
    • Sonic Gateway
    • Build on Sonic
      • Getting Started
      • Deploy Contracts
      • Verify Contracts
      • Tooling and Infra
      • Native USDC
      • Contract Addresses
      • Network Parameters
      • Programmatic Gateway
      • Learn Solidity
    • Node Deployment
      • Archive Node
      • Validator Node
      • Node Recovery
    • FAQ
  • Funding
    • Sonic Airdrop
      • Sonic Points
      • Sonic Gems
      • Game Gems
      • Sonic Boom
        • Winners
    • Fee Monetization
      • Apply
      • Dispute
      • FeeM Vault
    • Innovator Fund
    • Sonic & Sodas
  • MIGRATION
    • Overview
    • App Token Migration
    • Migration FAQ
  • Technology
    • Overview
    • Consensus
    • Database Storage
    • Proof of Stake
    • Audits
Powered by GitBook

© 2025 Sonic Labs

On this page
  • 1. How to recognize a failed node?
  • 2. Why did the node fail?
  • How to Fix Node Misconfiguration
  • How to Fix Resource Exhaustion
  • How to Fix Hardware Failures
  • 3. Recover From the Node Failure
Export as PDF
  1. Sonic
  2. Node Deployment

Node Recovery

PreviousValidator NodeNextFAQ

Last updated 23 days ago

Sonic nodes are generally stable and can run for extended periods without operator intervention. However, in rare cases, the software may encounter failures.

This page outlines the most common causes of such failures, suggests configuration adjustments to prevent them, and provides recovery steps to help you get your node back online as quickly as possible.

1. How to recognize a failed node?

Regardless of the reason, the outcome of a failure is always the same. The node uses a database to track the state of all accounts, smart contracts, transactions, and the consensus data.

Parts of the database is kept in memory during normal operation. Pre-calculated and updated data is eventually pushed into the persistent storage. If the update cannot finish properly, the persistent copy of the database is in an inconsistent state and can be corrupted.

If the node cannot verify the database integrity and consistency, it refuses to start. The sonicd daemon terminates shortly after you attempt to start it, and the output log will contain an error message similar to this:

failed to initialize the node: failed to make consensus engine: 
failed to open existing databases: dirty state: gossip-21614: DE

You may encounter a similar error message:

database is empty or the genesis import interrupted

In this case, check the correct path set for --datadir and the filesystem privileges on the folders and files on the path. The sonicd daemon needs read/write access to the data folder and the files within.

If the path and access rights are correct, check the genesis import logs. It should confirm a successful import when done. If the process was interrupted for some reason, for example due to insufficient storage size, or an I/O error, you need to fix these issues first and then repeat the import process.

2. Why did the node fail?

In general, there are three major reasons for Sonic node failure:

  • The node resource management configuration is intentionally left for the node operator to decide. This opens options for resource sharing and fine tuning your node to the specific use case you need. If not configured properly, it may also lead to unexpected crashes and failures.

  • There are lots of different scenarios of the node utilization. Some of them are very predictable and easily controllable, others may be more random. This mostly applies to the RPC interface and its operation. If your RPC load can vary greatly, especially if it involves high input/output disparity calls like utilizing the debug namespace, your node may run out of RAM memory and be forcefully terminated by the operating system.

  • Soft failures, for example storage space exhaustion, are usually detected by the node software correctly. In this case, the node tries to terminate gracefully. If an unpredictable issue arises, the node may terminate forcefully leading to the startup failure described above. This usually happens due to a storage device failure, different types of I/O errors, forcefully imposed operating system limits (file descriptors, open sockets, etc.), or forced system restart.

How to Fix Node Misconfiguration

  • Check for the environment variable GOMEMLIMIT to be present and set correctly. If the sonicd node is the only user space software running on the system, it should be set between 85% and 90% of the available RAM. Make sure to properly account for any other software running on your system, including modules executed only occasionally or on a schedule. The value should include units for clarity and readability — better use GOMEMLIMIT=28GiB instead of a value in bytes without the unit.

  • Check the cache size to be explicitly specified and set between 12GiB and 20GiB. Values lower than 12GiB may cause potential issues with processing large blocks. There is no real benefit going over 20GiB of cache. The value itself is specified in MiB. For example, --cache 12000 represents 12 GiB.

  • Check your operating system limits, especially for open file descriptors and/or sockets. Consult your operating system documentation for your case.

    tee -a /etc/security/limits.conf > /dev/null <<EOT
    @sonic           soft    nofile          950000
    @sonic           hard    nofile          950000
    EOT
  • If you use Systemd or a similar software lifecycle management tool, check the shutdown timeout of the service. The node shutdown procedure, if deployed on a recommended solid state drive (SSD), usually takes less than 15 seconds. If you use a remotely connected persistent storage, especially if an intermediate layer like cloud virtualization, or Kubernetes is involved, the shutdown may take significantly longer. We recommend setting the timeout to at least 1 minute.

    [Service]
    Type=simple
    User=sonic
    Group=sonic
    Environment="GOMEMLIMIT=28GiB"
    ExecStart=/usr/bin/sonicd --datadir=/var/lib/sonic --cache=16000
    LimitNOFILE=934073
    TimeoutSec=300

How to Fix Resource Exhaustion

The most common reason for resource exhaustion leading to a forced node termination and the database corruption is the RPC interface utilization. If the allocated memory crosses a system imposed threshold, the node is terminated unexpectedly and your state DB will be corrupted.

  • Verify what other software components are utilizing the system resources Repeated crashes of your Sonic node may suggest there is another system component utilizing the system resources. An example may be a cloud scheduled system update, log clean-up process, a solid state drive sweep and optimization, or a planed backup task. If not accounted for, these components may trigger RAM exhaustion. In such cases, the biggest consumer would be the Sonic node, and the system may decide to terminate it to free the resources needed to finish the pending work. Update your Sonic node resource limits, especially the GOMEMLIMIT and --cache , to account for these components.

  • Check the usage pattern of your node Some types of RPC API calls have very high ratios between input and output size. An example would be debug and trace namespaces and batch API calls. The whole output has to be kept in memory until the user request is resolved. Configure your system according to the expected usage pattern. We recommend at least 64 GiB of RAM for regular RPC nodes. High demand nodes should run on 128 GiB RAM or more.

  • Do not over-provision cache and/or allocation limits Increasing cache amount above 20GiB has a diminishing return. You should set it between 12 and 20 GiB. The memory allocated for the cache cannot be used to process RPC calls, and to store the responses to be transmitted to end users. Make sure to set the GOMEMLIMIT value so that the system has enough RAM available for the regular operating system tasks. This is usually between 85% and 90% of the available system RAM. A typical server operating system needs only around 1–2GiB to run properly, but the usage can spike up to 4GiB or more. If your node utilizes a software RAID, you should allocate enough RAM for the module to be able to effectively manage the RAID storage read/write operations.

  • Moderate access to public RPC API

  • Load-balance your traffic

    The best approach to handle unpredictable RPC API traffic is to use multiple backend nodes to resolve and balance the traffic based on resource consumption. This usually requires more complex infrastructure setup. The benefit of such approach is that a regular maintenance does not make your interface inaccessible. Using several smaller systems instead of a single overpowered one may allow you to better control your resources cost, and will make your system way more issue resistant as a whole. The downside is much more complex configuration and maintenance which usually requires knowledge of high availability system setup and scalable architecture deployment.

How to Fix Hardware Failures

A hardware failure usually cannot be predicted, but there are some steps you can take to lower the impact of some types of failures.

  • Configure graceful shutdown, especially on cloud deployments Cloud-based systems offer obvious benefits compared to bare metal solutions. They can be deployed easily, scaled dynamically, and migrated very quickly between different regions and/or system groups. If you opt for this type of configuration, make sure the shutdown process of the node uses a graceful ACPI shutdown, or its equivalent, and is set to wait for the procedure to finish. If not possible, configure the shutdown timer so that your Sonic node has enough time to sync its cache with the persistent storage. Please refer to the node deployment guidelines for the usual termination times. A non-cloud environment usually does not suffer from migrations, or unexpected resets, however they still may need to be shutdown for a hardware maintenance. Make sure to configure your system timeouts and the shutdown procedure to always allow your Sonic node to close properly.

A validator node must always use a single instance deployment. If you attempt to run multiple validator nodes with the same signing key, your validator will be penalized for double-signing, removed from the network, and your stake will be slashed.

  • Monitor your system health and resources

    This may be obvious, but your system will age, and it will fail eventually. Modern systems usually do have sensors which can warn you about possible failures and shortages in advance. Refer to your system documentation for the details about available monitoring solutions, and system level alerts, allowing you to take appropriate steps to resolve an imminent crash before it happens.

3. Recover From the Node Failure

If your node runs in the RPC mode and it did run at least 15 minutes after the initialization from the snapshot before it crashed, you may be able to recover the live state from the archive database. In this case, follow these steps:

  1. $ sonicd version
    Sonic
    Version: 2.0.5
    Git Commit: ea9e363178ea9fb28a723beb803fb5dcf223cbbc
    ...
  2. Try to recover the state from your archive database.

    GOMEMLIMIT=28GiB sonictool --datadir <path> --cache 12000 heal

    The heal will attempt to recover the state. If the healing succeeds, you can start your node normally and the node will finish syncing from the network. Any failure means your archive state is not consistent and cannot be used to recover from the failure.

If your node has a public RPC API interface, we strongly recommend to use a middleware layer moderating access to the node RPC port. A great start would be a proxy server, for example , or , configured to limit the number of parallel requests executed on the node interface. The proxy can create a queue for incoming calls before they can be safely pushed to the node for processing. You can also set the proxy to limit number of incoming calls from a single source, or a group of sources, and reject excessive traffic.

Use redundant storage (RAID) to prevent single drive failure caused crash Persistent storage failure is the most common type of hardware issue you may encounter. The node state data can be easily obtained from the Sonic network. There is no exclusive content on your drive beyond the <datadir>/keystore folder. Surely, your account keys and the validator consensus keys are already backed up securely. However, if your storage drive holding the state DB fails, the system will inevitably crash. To prevent the immediate impact of such failure, you should utilize with the redundancy level configured to match your resiliency expectations.

Use high availability deployment setup for up-time sensitive RPC systems As discussed above, if your RPC node is part of a critical infrastructure, you should always opt for some level of node redundancy. It would not only allow you to recover after an unexpected system failure, but will also help you perform regular maintenance without experiencing the whole system downtime. A great starting point may be a proxy based load balancer, for example .

If your node runs in the validator mode, it does not have enough data to recover from the failure on its own. After you of the node crash, you need to download the latest state snapshot and rebuild the corrupted database from it. Please refer to the validator node deployment guide for the the latest snapshot. You should consider to the of the node software before you proceed.

Identify and for the crash.

Make sure you have node version. If your local version is outdated, Sonic node software. Use the version command to check your node version:

If your heal process failed, you need to remove <datadir>/carmen and <datadir>/chaindata folders, download the latest archive or pruned database snapshot, and rebuild your state from them. Refer to the for the details of the process.

HAproxy
Nginx
RAID storage
HAproxy
Misconfiguration of the Node Software
Resources Exhaustion
Hardware Failure
fix the reason
latest available version
resolve the reason
the latest available Sonic
instructions how to obtain and unpack
upgrading your node
build the latest
archive node database priming guide