[Protocol] Reinventing State Sync

nikurt · May 15, 2023, 10:07am

Dear Validators and Node Operators,

I am excited to share an important update regarding the State Sync mechanism of the protocol.

Our goal is to reinvent how nodes sync state, making it faster, more scalable, and reliable.

This update is particularly relevant to those who run nodes and care about the protocol’s performance.

Goals

Re-inventing State Sync will let us 3 bigger goals. These goals are as follows:

Goal 1: Make it cheap to run a Chunk-only Producer

Sharding Phase 1 introduced Chunk-only Producers as a cost-effective way to participate in the network. The idea was that a node with a small stake gets the same rate of rewards, but only needs to run inexpensive hardware. The new Validator role will track only a single shard and produce chunks for that shard. However, we discovered that there is no reliable mechanism for these nodes to switch between tracking different shards.

Goal 2: Scale the network to 100 Shards

Today the network requires the Validator nodes to track all shards. This prevents the network to scaling beyond 4 shards.

To scale the network, the validators will have to validate only subsets of transactions, which is described in Sharding Phase 2. This means that we can’t go to Sharding Phase 2 without a reliable mechanism for switching between tracked shards.

Goal 3: Sync a node that is far behind

When a node is offline for a significant period, syncing all the blocks created during that time becomes costly and time-consuming. State Sync allows nodes to skip processing most blocks and sync only the blocks created in the current epoch. See State Sync documentation for details. This mechanism will not only improve sync time, but will also facilitate switching between shards tracked.

Why State Sync doesn’t work today

Currently, State Sync faces challenges on the testnet and mainnet due to the following reasons:

Shard 2 and Shard 3 have too large states.
Obtaining state parts is very inefficient.

When a node needs to obtain the state of a shard, it relies on requesting state parts from other nodes in the network. However, this process poses a significant challenge. The computational workload involved in retrieving all the necessary state parts, particularly for large shards, can be as extensive as 70 hours. This is primarily due to the substantial number of random lookups in RocksDB, reaching hundreds of millions. As a result, the performance of Validator nodes is noticeably impacted, leading many nodes to limit or completely ignore state part requests.

Requiring such an extensive 70-hour computational effort to synchronize a single node is impractical and unreasonable for the network as a whole.

The Plan: Fast, Scalable, and Reliable State Sync

We’re going to completely re-work the State Sync mechanism to make it fast, scalable and reliable.
The plan comprises two main parts:

State Sync via S3 - to be code complete and tested by end of June 2023
Bittorrent-like State Sync - to be code complete by end of December 2023

State Sync via S3

This is a short-term solution that addresses all the issues mentioned above, but comes at a cost of being centralized.

As soon as an epoch completes, multiple Pagoda-managed nodes will ensure that the State is available in a publicly accessible location in AWS S3 within a few hours.
Nodes requiring the State for specific shards can discover and download it from the AWS S3 location, making State Sync faster and more reliable.

This short-term solution leverages the existing infrastructure provided by Pagoda, which already manages snapshots of the database for testnet and mainnet using AWS S3. While this approach is centralized, it offers an acceptable tradeoff until a decentralized solution is implemented. Moreover, users will have the flexibility to create their

own State dumps on S3 or other external storage.

Pagoda already provides and manages snapshots of the database for testnet and mainnet using AWS S3. This is the de-facto standard way to setup new nodes or to “reprovision” nodes to fix any issues such as DB corruptions.

Therefore, it seems an acceptable short-term tradeoff to introduce a mechanism that downloads state from the same service. Also note that you’ll be able to make your own dumps of State to S3 or other External Storage.

As part of our efforts to enhance the State Sync mechanism, we are undertaking a rework of several key aspects. These include:

Format of state parts: By optimizing the format of state parts, we are able to significantly reduce overhead. This optimization allows us to represent the data more efficiently, resulting in a significant reduction in the number of bytes that need to be downloaded during State Sync.
Size of state parts: We have identified that the current 1MB state parts include excessive overhead. By increasing the size of state parts to 30MB, we can reduce the overall size of state dumps by 20% for mainnet and an impressive 68% for testnet. This reduction is achieved by minimizing the duplication of identical values, such as identical Contracts deployed to multiple accounts, and by reducing the amount of overhead required for validating state parts.
Creation of state parts: Instead of the previous method of walking the trie and performing random lookups for each Node and Value, we plan to implement an approach that leverages Flat Storage with small values inlined. This change is supposed to speed up the creation of state parts by an order of magnitude.
Application of state parts: Rather than waiting for all state parts to become available before applying them sequentially, we have adopted a new approach where each State Part is applied as soon as it becomes available. While this may result in some Nodes and Values being written multiple times, this trade-off ensures a faster and more continuous application of state parts.

We aim to make this code complete and tested by the end of June 2023, and include it in nearcore version 1.36.0.

Bittorrent-like State Sync

In the long term, we envision a decentralized solution inspired by the Bittorrent protocol, which will enable all nodes to participate in State Sync. Here’s a glimpse of our plans.

Nodes with State will create specific state parts and make them available for download to their peers. Using information exchange similar to Bittorrent, nodes will facilitate the distribution and retrieval of state parts among the network. This approach eliminates the disproportional load on validator nodes.

We aim to make this mechanism available by the end of year 2023.

Extra: Epoch Sync

Epoch Sync mechanism is mostly code complete. We aim to complete the effort and release this feature by the end of September 2023.

We are very excited about the problem Epoch Sync is going to solve: the ability to start a node with an empty database and have it fully synced in just a few hours.

Instead of the process of downloading every single header in the network’s history, Epoch Sync introduces a light-client sync approach. This means that only a single header per epoch needs to be downloaded. I can make the sync 1000x faster.

Conclusion: Collaboration and Feedback

We are dedicated to making State Sync highly functional as soon as possible. By implementing State Sync via S3 as a short-term solution and working towards a Bittorrent-like decentralized solution in the long term, we aim to address the challenges associated with state synchronization.

We appreciate your patience throughout this process and encourage your feedback, suggestions, and concerns regarding our proposed plan and solutions. Together, we can build a robust and scalable protocol.

Thank you for your ongoing support and participation.

gonemultichain · May 16, 2023, 1:08pm

@nikurt what a comeback to the forum!

thanks for the update, this is fantastic.

satojandro · May 17, 2023, 2:58am

As a non-technical contributor, this past was very easy to read and understand!

Thanks so much for laying all the information down in such a well structured way.

Looking forward to more updates

cloudmex-alan · May 17, 2023, 7:30am

Awesome news @nikurt !

Can I ask which are the major risks associated to State Sync via S3?

Also, do you have an approximate of the cost reduction for chunk producers? RN I can calculate that income for chunk producers at the bottom of the validators set (~30k NEARs staked/10% fee) should be around ~$50 USD/month.

If this make possible to run chunk producer node without loosing money would be great! Excited to see how this goes!

DDeAlmeida · May 17, 2023, 8:44am

Wow pretty cool backlog.
Can’t wait to test it !

Thank you to the Dev Team working on

nikurt · May 17, 2023, 9:24am

Very good questions, thank you!

Risks

The biggest risk is that Pagoda-managed nodes need to reliably provide States on S3 for every shard every epoch. If State on S3 happens to be incomplete or not available, none of the nodes in the network will be able to sync that shard. If it happens once, it will be bad, if it happens for multiple epochs in a row, it can block the network.
We are aware of this risk and are going to mitigate it by overprovisioning production of State, as well as setting up monitoring and alerting for that process.

Reliability of AWS infrastructure is also a risk, and AWS does have incidents once in a while. But for it to have major effect on State Sync, the incident needs to last multiple hours, which is really unlikely.

cost reduction for chunk producers?

Hard to say without testing, and I don’t want to over-promise.

Optimistically, the reduction will be of 4 cores and 4-8 GB RAMS.

Using Pricing of Google Cloud that would be about $25.4 * 4 (CPUs) + $3.4 * 4 (GB RAM) = $115/month. But there are options to get resources cheaper than that even in Google Cloud.

Currently, the minimal hardware requirements are 8 cores and 16GB RAM.
This many CPUs are needed to process all transactions from all shards within 1 second. But if a Chunk-only Producer tracks only 1 shard, it will validate transactions only from that 1 shard, and won’t use all 8 cores.
I expect that 4 cores will be enough for a chunk-only producer.
How much RAM requirements will be able to decrease, I don’t know. Maybe it will decrease all the way to 8GB, but 12GB sounds more realistic.

Topic		Replies	Views
Challenges of state challenges Development	1	638	May 20, 2021
Block and chunk producer selection algorithm in simple nightshade Staking	35	2596	April 22, 2021
Roadmap to phase 2 of sharding - high level doc Development	4	389	November 4, 2022
Engineering Weekly Update - 2021-09-13 Engineering Updates weekly-updates	0	491	September 13, 2021
Economic analysis of Chunk only producers validators research	10	802	January 30, 2023