Dear Validators and Node Operators,
I am excited to share an important update regarding the State Sync mechanism of the protocol.
Our goal is to reinvent how nodes sync state, making it faster, more scalable, and reliable.
This update is particularly relevant to those who run nodes and care about the protocol’s performance.
Goals
Re-inventing State Sync will let us 3 bigger goals. These goals are as follows:
Goal 1: Make it cheap to run a Chunk-only Producer
Sharding Phase 1 introduced Chunk-only Producers as a cost-effective way to participate in the network. The idea was that a node with a small stake gets the same rate of rewards, but only needs to run inexpensive hardware. The new Validator role will track only a single shard and produce chunks for that shard. However, we discovered that there is no reliable mechanism for these nodes to switch between tracking different shards.
Goal 2: Scale the network to 100 Shards
Today the network requires the Validator nodes to track all shards. This prevents the network to scaling beyond 4 shards.
To scale the network, the validators will have to validate only subsets of transactions, which is described in Sharding Phase 2. This means that we can’t go to Sharding Phase 2 without a reliable mechanism for switching between tracked shards.
Goal 3: Sync a node that is far behind
When a node is offline for a significant period, syncing all the blocks created during that time becomes costly and time-consuming. State Sync allows nodes to skip processing most blocks and sync only the blocks created in the current epoch. See State Sync documentation for details. This mechanism will not only improve sync time, but will also facilitate switching between shards tracked.
Why State Sync doesn’t work today
Currently, State Sync faces challenges on the testnet
and mainnet
due to the following reasons:
- Shard 2 and Shard 3 have too large states.
- Obtaining state parts is very inefficient.
When a node needs to obtain the state of a shard, it relies on requesting state parts from other nodes in the network. However, this process poses a significant challenge. The computational workload involved in retrieving all the necessary state parts, particularly for large shards, can be as extensive as 70 hours. This is primarily due to the substantial number of random lookups in RocksDB, reaching hundreds of millions. As a result, the performance of Validator nodes is noticeably impacted, leading many nodes to limit or completely ignore state part requests.
Requiring such an extensive 70-hour computational effort to synchronize a single node is impractical and unreasonable for the network as a whole.
The Plan: Fast, Scalable, and Reliable State Sync
We’re going to completely re-work the State Sync mechanism to make it fast, scalable and reliable.
The plan comprises two main parts:
- State Sync via S3 - to be code complete and tested by end of June 2023
- Bittorrent-like State Sync - to be code complete by end of December 2023
State Sync via S3
This is a short-term solution that addresses all the issues mentioned above, but comes at a cost of being centralized.
As soon as an epoch completes, multiple Pagoda-managed nodes will ensure that the State is available in a publicly accessible location in AWS S3 within a few hours.
Nodes requiring the State for specific shards can discover and download it from the AWS S3 location, making State Sync faster and more reliable.
This short-term solution leverages the existing infrastructure provided by Pagoda, which already manages snapshots of the database for testnet
and mainnet
using AWS S3. While this approach is centralized, it offers an acceptable tradeoff until a decentralized solution is implemented. Moreover, users will have the flexibility to create their
own State dumps on S3 or other external storage.
Pagoda already provides and manages snapshots of the database for testnet
and mainnet
using AWS S3. This is the de-facto standard way to setup new nodes or to “reprovision” nodes to fix any issues such as DB corruptions.
Therefore, it seems an acceptable short-term tradeoff to introduce a mechanism that downloads state from the same service. Also note that you’ll be able to make your own dumps of State to S3 or other External Storage.
As part of our efforts to enhance the State Sync mechanism, we are undertaking a rework of several key aspects. These include:
- Format of state parts: By optimizing the format of state parts, we are able to significantly reduce overhead. This optimization allows us to represent the data more efficiently, resulting in a significant reduction in the number of bytes that need to be downloaded during State Sync.
- Size of state parts: We have identified that the current 1MB state parts include excessive overhead. By increasing the size of state parts to 30MB, we can reduce the overall size of state dumps by 20% for mainnet and an impressive 68% for testnet. This reduction is achieved by minimizing the duplication of identical values, such as identical Contracts deployed to multiple accounts, and by reducing the amount of overhead required for validating state parts.
- Creation of state parts: Instead of the previous method of walking the trie and performing random lookups for each Node and Value, we plan to implement an approach that leverages Flat Storage with small values inlined. This change is supposed to speed up the creation of state parts by an order of magnitude.
- Application of state parts: Rather than waiting for all state parts to become available before applying them sequentially, we have adopted a new approach where each State Part is applied as soon as it becomes available. While this may result in some Nodes and Values being written multiple times, this trade-off ensures a faster and more continuous application of state parts.
We aim to make this code complete and tested by the end of June 2023, and include it in nearcore version 1.36.0
.
Bittorrent-like State Sync
In the long term, we envision a decentralized solution inspired by the Bittorrent protocol, which will enable all nodes to participate in State Sync. Here’s a glimpse of our plans.
Nodes with State will create specific state parts and make them available for download to their peers. Using information exchange similar to Bittorrent, nodes will facilitate the distribution and retrieval of state parts among the network. This approach eliminates the disproportional load on validator nodes.
We aim to make this mechanism available by the end of year 2023.
Extra: Epoch Sync
Epoch Sync mechanism is mostly code complete. We aim to complete the effort and release this feature by the end of September 2023.
We are very excited about the problem Epoch Sync is going to solve: the ability to start a node with an empty database and have it fully synced in just a few hours.
Instead of the process of downloading every single header in the network’s history, Epoch Sync introduces a light-client sync approach. This means that only a single header per epoch needs to be downloaded. I can make the sync 1000x faster.
Conclusion: Collaboration and Feedback
We are dedicated to making State Sync highly functional as soon as possible. By implementing State Sync via S3 as a short-term solution and working towards a Bittorrent-like decentralized solution in the long term, we aim to address the challenges associated with state synchronization.
We appreciate your patience throughout this process and encourage your feedback, suggestions, and concerns regarding our proposed plan and solutions. Together, we can build a robust and scalable protocol.
Thank you for your ongoing support and participation.