Storage Write Optimizations

This document explores an idea that might improve the performance of storage writes. It is an extension that can be built on top of FlatStorage

The bottleneck of today’s storage write performance is still at the read performance. Writing to rocksdb is fast, but to update a leaf node in the trie, all nodes on the path must be read and updated. There are ideas around moving these reads out of the critical path of block processing, possibly to another thread or delayed. While these ideas could help, it is unclear how much of a performance benefit they can bring and how the background process can be charged. Moving the db accesses to a background thread doesn’t mean they are free. They will still be bounded by the IO and CPU resources.

How can we improve the performance of writes, more specifically, the read part in writes? One issue with today’s trie storage is that data is stored at hash(data), which means the access pattern of data is random and not efficient. To solve this problem, in FlatStorage, instead of only storing the leaf nodes, we can also store the intermediate trie nodes, the data of trie node at key “key: vec” is stored under location “key” in column “DBCol::FlatStateTrieNodes”. In addition, we can store the value of the trie nodes instead of the value ref for these intermediate trie nodes, because their sizes are bounded.

Similar to how the existing FlatStorage design works, the map just stores a snapshot of the trie state at the last final block and we will need to deal with forks. We can continue to use the FlatStateDelta idea as in the current design, but we will need to estimate the size of the new delta since now it may include all intermediate trie nodes, and it might be too large.

If the estimate comes up to be too large that we can’t fit many deltas in memory, another solution can be to store the deltas for each key in the db column. More specifically, the db column will store

Row: vec<u8>
Column: TrieNodeWithChanges

struct TrieNodeWithChanges {
    // the value of this key at the current flat head, which is usually the last final block
    value_at_head: Option<RawTrieNodeWithSize>
    // a map from block hash to the value of the trie node after that block, if the block makes change to this trie node
    changes: HashMap<CryptoHash, Option<RawTrieNodeWithSize>>   
}

To read the value of a trie node in a given block, we simply fetch the corresponding TrieNodeWithChanges. It’s easy to calculate the value of a trie node at the target block based on the information stored in this struct.

Using this new design, one write to state will still require the same number of db reads to fetch all the trie nodes on the path to the leaf node. But these reads will hit rocksdb in a much nicer pattern since they share prefixes. In addition, this allows better cache utilization if the same key or keys share prefix are modified in consecutive blocks. Previously, since the key to a trie node is its hash, after a write, a new trie node is created under a different hash and added to the cache while the old trie node may still stay in the cache, taking up space but won’t be accessed again.

One implication of this design is that ColState would basically not be accessed (read) at all during normal block processing, which brings two nice side effects:

  • It would help a lot with archival nodes (before cold storage is delivered) - as size of flat storage will be the same in archival and in RPC - so the block processing performance should be the same
  • We could experiment with putting the flat storage column in a separate rocksDB – which could be mounted on RAMdisk. So if someone has 40GB of RAM - their node would be super-fast.

It’s hard to say how much benefit this could bring before the implementation. Theoretically, it’s known that rocksdb should behave much better when the keys are sorted. This project can also be used as an experiment to see how much data locality could improve storage performances. If the result comes out to be good, we should consider reorganizing trie keys to store data of the same account together.

3 Likes