Node Recovery

Sonic nodes are generally stable and can run for extended periods without operator intervention. However, in rare cases, the software may encounter failures.

This page outlines the most common causes of such failures, suggests configuration adjustments to prevent them, and provides recovery steps to help you get your node back online as quickly as possible.

1. How to recognize a failed node?

Regardless of the reason, the outcome of a failure is always the same. The node uses a database to track the state of all accounts, smart contracts, transactions, and the consensus data.

Parts of the database is kept in memory during normal operation. Pre-calculated and updated data is eventually pushed into the persistent storage. If the update cannot finish properly, the persistent copy of the database is in an inconsistent state and can be corrupted.

If the node cannot verify the database integrity and consistency, it refuses to start. The sonicd daemon terminates shortly after you attempt to start it, and the output log will contain an error message similar to this:

failed to initialize the node: failed to make consensus engine: 
failed to open existing databases: dirty state: gossip-21614: DE

You may encounter a similar error message:

database is empty or the genesis import interrupted

In this case, check the correct path set for --datadir and the filesystem privileges on the folders and files on the path. The sonicd daemon needs read/write access to the data folder and the files within.

If the path and access rights are correct, check the genesis import logs. It should confirm a successful import when done. If the process was interrupted for some reason, for example due to insufficient storage size, or an I/O error, you need to fix these issues first and then repeat the import process.

2. Why did the node fail?

In general, there are three major reasons for Sonic node failure:

Misconfiguration of the Node Software
The node resource management configuration is intentionally left for the node operator to decide. This opens options for resource sharing and fine tuning your node to the specific use case you need. If not configured properly, it may also lead to unexpected crashes and failures.
Resources Exhaustion
There are lots of different scenarios of the node utilization. Some of them are very predictable and easily controllable, others may be more random. This mostly applies to the RPC interface and its operation. If your RPC load can vary greatly, especially if it involves high input/output disparity calls like utilizing the debug namespace, your node may run out of RAM memory and be forcefully terminated by the operating system.
Hardware Failure
Soft failures, for example storage space exhaustion, are usually detected by the node software correctly. In this case, the node tries to terminate gracefully. If an unpredictable issue arises, the node may terminate forcefully leading to the startup failure described above. This usually happens due to a storage device failure, different types of I/O errors, forcefully imposed operating system limits (file descriptors, open sockets, etc.), or forced system restart.

How to Fix Node Misconfiguration

Check for the environment variable GOMEMLIMIT to be present and set correctly. If the sonicd node is the only user space software running on the system, it should be set between 85% and 90% of the available RAM. Make sure to properly account for any other software running on your system, including modules executed only occasionally or on a schedule. The value should include units for clarity and readability — better use GOMEMLIMIT=28GiB instead of a value in bytes without the unit.
Check the cache size to be explicitly specified and set between 12GiB and 20GiB. Values lower than 12GiB may cause potential issues with processing large blocks. There is no real benefit going over 20GiB of cache. The value itself is specified in MiB. For example, --cache 12000 represents 12 GiB.

Check your operating system limits, especially for open file descriptors and/or sockets. Consult your operating system documentation for your case.

tee -a /etc/security/limits.conf > /dev/null <<EOT
@sonic           soft    nofile          950000
@sonic           hard    nofile          950000
EOT

If you use Systemd or a similar software lifecycle management tool, check the shutdown timeout of the service. The node shutdown procedure, if deployed on a recommended solid state drive (SSD), usually takes less than 15 seconds. If you use a remotely connected persistent storage, especially if an intermediate layer like cloud virtualization, or Kubernetes is involved, the shutdown may take significantly longer. We recommend setting the timeout to at least 1 minute.
```
[Service]
Type=simple
User=sonic
Group=sonic
Environment="GOMEMLIMIT=28GiB"
ExecStart=/usr/bin/sonicd --datadir=/var/lib/sonic --cache=16000
LimitNOFILE=934073
TimeoutSec=300
```

How to Fix Resource Exhaustion

The most common reason for resource exhaustion leading to a forced node termination and the database corruption is the RPC interface utilization. If the allocated memory crosses a system imposed threshold, the node is terminated unexpectedly and your state DB will be corrupted.

Verify what other software components are utilizing the system resources Repeated crashes of your Sonic node may suggest there is another system component utilizing the system resources. An example may be a cloud scheduled system update, log clean-up process, a solid state drive sweep and optimization, or a planed backup task. If not accounted for, these components may trigger RAM exhaustion. In such cases, the biggest consumer would be the Sonic node, and the system may decide to terminate it to free the resources needed to finish the pending work. Update your Sonic node resource limits, especially the GOMEMLIMIT and --cache , to account for these components.
Check the usage pattern of your node Some types of RPC API calls have very high ratios between input and output size. An example would be debug and trace namespaces and batch API calls. The whole output has to be kept in memory until the user request is resolved. Configure your system according to the expected usage pattern. We recommend at least 64 GiB of RAM for regular RPC nodes. High demand nodes should run on 128 GiB RAM or more.
Do not over-provision cache and/or allocation limits Increasing cache amount above 20GiB has a diminishing return. You should set it between 12 and 20 GiB. The memory allocated for the cache cannot be used to process RPC calls, and to store the responses to be transmitted to end users. Make sure to set the GOMEMLIMIT value so that the system has enough RAM available for the regular operating system tasks. This is usually between 85% and 90% of the available system RAM. A typical server operating system needs only around 1–2GiB to run properly, but the usage can spike up to 4GiB or more. If your node utilizes a software RAID, you should allocate enough RAM for the module to be able to effectively manage the RAID storage read/write operations.
Moderate access to public RPC API
If your node has a public RPC API interface, we strongly recommend to use a middleware layer moderating access to the node RPC port. A great start would be a proxy server, for example HAproxy, or Nginx, configured to limit the number of parallel requests executed on the node interface. The proxy can create a queue for incoming calls before they can be safely pushed to the node for processing. You can also set the proxy to limit number of incoming calls from a single source, or a group of sources, and reject excessive traffic.
Load-balance your traffic
The best approach to handle unpredictable RPC API traffic is to use multiple backend nodes to resolve and balance the traffic based on resource consumption. This usually requires more complex infrastructure setup. The benefit of such approach is that a regular maintenance does not make your interface inaccessible. Using several smaller systems instead of a single overpowered one may allow you to better control your resources cost, and will make your system way more issue resistant as a whole. The downside is much more complex configuration and maintenance which usually requires knowledge of high availability system setup and scalable architecture deployment.

How to Fix Hardware Failures

A hardware failure usually cannot be predicted, but there are some steps you can take to lower the impact of some types of failures.

Configure graceful shutdown, especially on cloud deployments Cloud-based systems offer obvious benefits compared to bare metal solutions. They can be deployed easily, scaled dynamically, and migrated very quickly between different regions and/or system groups. If you opt for this type of configuration, make sure the shutdown process of the node uses a graceful ACPI shutdown, or its equivalent, and is set to wait for the procedure to finish. If not possible, configure the shutdown timer so that your Sonic node has enough time to sync its cache with the persistent storage. Please refer to the node deployment guidelines for the usual termination times. A non-cloud environment usually does not suffer from migrations, or unexpected resets, however they still may need to be shutdown for a hardware maintenance. Make sure to configure your system timeouts and the shutdown procedure to always allow your Sonic node to close properly.
Use redundant storage (RAID) to prevent single drive failure caused crash Persistent storage failure is the most common type of hardware issue you may encounter. The node state data can be easily obtained from the Sonic network. There is no exclusive content on your drive beyond the <datadir>/keystore folder. Surely, your account keys and the validator consensus keys are already backed up securely. However, if your storage drive holding the state DB fails, the system will inevitably crash. To prevent the immediate impact of such failure, you should utilize RAID storage with the redundancy level configured to match your resiliency expectations.
Use high availability deployment setup for up-time sensitive RPC systems As discussed above, if your RPC node is part of a critical infrastructure, you should always opt for some level of node redundancy. It would not only allow you to recover after an unexpected system failure, but will also help you perform regular maintenance without experiencing the whole system downtime. A great starting point may be a proxy based load balancer, for example HAproxy.

A validator node must always use a single instance deployment. If you attempt to run multiple validator nodes with the same signing key, your validator will be penalized for double-signing, removed from the network, and your stake will be slashed.

Monitor your system health and resources
This may be obvious, but your system will age, and it will fail eventually. Modern systems usually do have sensors which can warn you about possible failures and shortages in advance. Refer to your system documentation for the details about available monitoring solutions, and system level alerts, allowing you to take appropriate steps to resolve an imminent crash before it happens.

3. Recover From the Node Failure

If your node runs in the validator mode, it does not have enough data to recover from the failure on its own. After you resolve the reason of the node crash, you need to download the latest state snapshot and rebuild the corrupted database from it. Please refer to the validator node deployment guide for the instructions how to obtain and unpack the latest snapshot. You should consider upgrading your node to the latest available version of the node software before you proceed.

If your node runs in the RPC mode and it did run at least 15 minutes after the initialization from the snapshot before it crashed, you may be able to recover the live state from the archive database. In this case, follow these steps:

Identify and fix the reason for the crash.
Make sure you have the latest available Sonic node version. If your local version is outdated, build the latest Sonic node software. Use the version command to check your node version:
```
$ sonicd version
Sonic
Version: 2.0.5
Git Commit: ea9e363178ea9fb28a723beb803fb5dcf223cbbc
...
```
Try to recover the state from your archive database.
```
GOMEMLIMIT=28GiB sonictool --datadir <path> --cache 12000 heal
```
The heal will attempt to recover the state. If the healing succeeds, you can start your node normally and the node will finish syncing from the network. Any failure means your archive state is not consistent and cannot be used to recover from the failure.
If your heal process failed, you need to remove <datadir>/carmen and <datadir>/chaindata folders, download the latest archive or pruned database snapshot, and rebuild your state from them. Refer to the archive node database priming guide for the details of the process.

PreviousValidator Node NextFAQ

Last updated 3 months ago