Node Recovery
Last updated
Last updated
Sonic nodes are generally stable and can run for extended periods without operator intervention. However, in rare cases, the software may encounter failures.
This page outlines the most common causes of such failures, suggests configuration adjustments to prevent them, and provides recovery steps to help you get your node back online as quickly as possible.
Regardless of the reason, the outcome of a failure is always the same. The node uses a database to track the state of all accounts, smart contracts, transactions, and the consensus data.
Parts of the database is kept in memory during normal operation. Pre-calculated and updated data is eventually pushed into the persistent storage. If the update cannot finish properly, the persistent copy of the database is in an inconsistent state and can be corrupted.
If the node cannot verify the database integrity and consistency, it refuses to start. The sonicd
daemon terminates shortly after you attempt to start it, and the output log will contain an error message similar to this:
In general, there are three major reasons for Sonic node failure:
The node resource management configuration is intentionally left for the node operator to decide. This opens options for resource sharing and fine tuning your node to the specific use case you need. If not configured properly, it may also lead to unexpected crashes and failures.
There are lots of different scenarios of the node utilization. Some of them are very predictable and easily controllable, others may be more random. This mostly applies to the RPC interface and its operation. If your RPC load can vary greatly, especially if it involves high input/output disparity calls like utilizing the debug
namespace, your node may run out of RAM memory and be forcefully terminated by the operating system.
Soft failures, for example storage space exhaustion, are usually detected by the node software correctly. In this case, the node tries to terminate gracefully. If an unpredictable issue arises, the node may terminate forcefully leading to the startup failure described above. This usually happens due to a storage device failure, different types of I/O errors, forcefully imposed operating system limits (file descriptors, open sockets, etc.), or forced system restart.
Check for the environment variable GOMEMLIMIT
to be present and set correctly. If the sonicd
node is the only user space software running on the system, it should be set between 85% and 90% of the available RAM. Make sure to properly account for any other software running on your system, including modules executed only occasionally or on a schedule. The value should include units for clarity and readability — better use GOMEMLIMIT=28GiB
instead of a value in bytes without the unit.
Check the cache size to be explicitly specified and set between 12GiB and 20GiB. Values lower than 12GiB may cause potential issues with processing large blocks. There is no real benefit going over 20GiB of cache. The value itself is specified in MiB. For example, --cache 12000
represents 12 GiB.
Check your operating system limits, especially for open file descriptors and/or sockets. Consult your operating system documentation for your case.
If you use Systemd or a similar software lifecycle management tool, check the shutdown timeout of the service. The node shutdown procedure, if deployed on a recommended solid state drive (SSD), usually takes less than 15 seconds. If you use a remotely connected persistent storage, especially if an intermediate layer like cloud virtualization, or Kubernetes is involved, the shutdown may take significantly longer. We recommend setting the timeout to at least 1 minute.
The most common reason for resource exhaustion leading to a forced node termination and the database corruption is the RPC interface utilization. If the allocated memory crosses a system imposed threshold, the node is terminated unexpectedly and your state DB will be corrupted.
Verify what other software components are utilizing the system resources
Repeated crashes of your Sonic node may suggest there is another system component utilizing the system resources. An example may be a cloud scheduled system update, log clean-up process, a solid state drive sweep and optimization, or a planed backup task. If not accounted for, these components may trigger RAM exhaustion. In such cases, the biggest consumer would be the Sonic node, and the system may decide to terminate it to free the resources needed to finish the pending work. Update your Sonic node resource limits, especially the GOMEMLIMIT
and --cache
, to account for these components.
Check the usage pattern of your node
Some types of RPC API calls have very high ratios between input and output size. An example would be debug
and trace
namespaces and batch API calls. The whole output has to be kept in memory until the user request is resolved. Configure your system according to the expected usage pattern. We recommend at least 64 GiB of RAM for regular RPC nodes. High demand nodes should run on 128 GiB RAM or more.
Do not over-provision cache and/or allocation limits
Increasing cache amount above 20GiB has a diminishing return. You should set it between 12 and 20 GiB. The memory allocated for the cache cannot be used to process RPC calls, and to store the responses to be transmitted to end users. Make sure to set the GOMEMLIMIT
value so that the system has enough RAM available for the regular operating system tasks. This is usually between 85% and 90% of the available system RAM. A typical server operating system needs only around 1–2GiB to run properly, but the usage can spike up to 4GiB or more. If your node utilizes a software RAID, you should allocate enough RAM for the module to be able to effectively manage the RAID storage read/write operations.
Moderate access to public RPC API
Load-balance your traffic
The best approach to handle unpredictable RPC API traffic is to use multiple backend nodes to resolve and balance the traffic based on resource consumption. This usually requires more complex infrastructure setup. The benefit of such approach is that a regular maintenance does not make your interface inaccessible. Using several smaller systems instead of a single overpowered one may allow you to better control your resources cost, and will make your system way more issue resistant as a whole. The downside is much more complex configuration and maintenance which usually requires knowledge of high availability system setup and scalable architecture deployment.
A hardware failure usually cannot be predicted, but there are some steps you can take to lower the impact of some types of failures.
Configure graceful shutdown, especially on cloud deployments Cloud-based systems offer obvious benefits compared to bare metal solutions. They can be deployed easily, scaled dynamically, and migrated very quickly between different regions and/or system groups. If you opt for this type of configuration, make sure the shutdown process of the node uses a graceful ACPI shutdown, or its equivalent, and is set to wait for the procedure to finish. If not possible, configure the shutdown timer so that your Sonic node has enough time to sync its cache with the persistent storage. Please refer to the node deployment guidelines for the usual termination times. A non-cloud environment usually does not suffer from migrations, or unexpected resets, however they still may need to be shutdown for a hardware maintenance. Make sure to configure your system timeouts and the shutdown procedure to always allow your Sonic node to close properly.
A validator node must always use a single instance deployment. If you attempt to run multiple validator nodes with the same signing key, your validator will be penalized for double-signing, removed from the network, and your stake will be slashed.
Monitor your system health and resources
This may be obvious, but your system will age, and it will fail eventually. Modern systems usually do have sensors which can warn you about possible failures and shortages in advance. Refer to your system documentation for the details about available monitoring solutions, and system level alerts, allowing you to take appropriate steps to resolve an imminent crash before it happens.
If your node runs in the RPC mode and it did run at least 15 minutes after the initialization from the snapshot before it crashed, you may be able to recover the live state from the archive database. In this case, follow these steps:
Try to recover the state from your archive database.
The heal will attempt to recover the state. If the healing succeeds, you can start your node normally and the node will finish syncing from the network. Any failure means your archive state is not consistent and cannot be used to recover from the failure.
If your node has a public RPC API interface, we strongly recommend to use a middleware layer moderating access to the node RPC port. A great start would be a proxy server, for example , or , configured to limit the number of parallel requests executed on the node interface. The proxy can create a queue for incoming calls before they can be safely pushed to the node for processing. You can also set the proxy to limit number of incoming calls from a single source, or a group of sources, and reject excessive traffic.
Use redundant storage (RAID) to prevent single drive failure caused crash
Persistent storage failure is the most common type of hardware issue you may encounter. The node state data can be easily obtained from the Sonic network. There is no exclusive content on your drive beyond the <datadir>/keystore
folder. Surely, your account keys and the validator consensus keys are already backed up securely. However, if your storage drive holding the state DB fails, the system will inevitably crash. To prevent the immediate impact of such failure, you should utilize with the redundancy level configured to match your resiliency expectations.
Use high availability deployment setup for up-time sensitive RPC systems As discussed above, if your RPC node is part of a critical infrastructure, you should always opt for some level of node redundancy. It would not only allow you to recover after an unexpected system failure, but will also help you perform regular maintenance without experiencing the whole system downtime. A great starting point may be a proxy based load balancer, for example .
If your node runs in the validator mode, it does not have enough data to recover from the failure on its own. After you of the node crash, you need to download the latest state snapshot and rebuild the corrupted database from it. Please refer to the validator node deployment guide for the the latest snapshot. You should consider to the of the node software before you proceed.
Identify and for the crash.
Make sure you have node version. If your local version is outdated, Sonic node software. Use the version
command to check your node version:
If your heal process failed, you need to remove <datadir>/carmen
and <datadir>/chaindata
folders, download the latest archive or pruned database snapshot, and rebuild your state from them. Refer to the for the details of the process.