Design Doc of near-sandbox

This has been internally approved and post here for a community level discussion

Background

We had two recent key results for contract developer’s to 1) easily reproduce a testnet/mainnet transaction locally (or without running a node with full states by themselves) in order to see gas breakdown, and to 2) dry run some contract methods without deploying the contract and with real state.

Motivation

Similar functionality provided by other blockchains, such as ganache/truffle suite allows developers to do so. In current NEAR toolkits there’s many set up burdens to get gas breakdown locally, majorly they have to recompile the code with gas breakdown enabled, sync the node with full state of testnet, and other non-trivial script setups.

Proposed Design

For gas breakdown, the consensus is to enable gas profiling data for every node, not some special archival node since it adds only a neglectable performance penalty. The progress is in https://github.com/near/nearcore/pull/4185.

For dry run or reproducing a contract execution locally, we decide to build a ganache-like node that is compiled from neard with a special feature, I call it “sandbox” feature. The exact naming isn’t decided but let it be “sandbox” for describing the design.

The neard compiled with sandbox feature is called near-sandbox in the following doc, it has the following features:

  • Start a local node from given genesis. The genesis format is same as neard
  • All RPCs, transaction runtime, store etc. that neard has
  • Additional RPC to patch state, fast forward in time
  • Don’t remove components from neard, instead prohibited to join testnet and mainnet because patch state is dangerous that breaking consensus.

Justification

This ganache-like approach is proven by market most popular approach to contract developers. It is probably not arguable that this provides best developer experience and easy onboarding from other blockchain. So the justification here is to demonstrate this approach isn’t too complex or too much more compare to the alternatives.

First it doesn’t add much refactoring or organization of nearcore architecture in order to remove certain components like network layer and chain layer. Instead, it adds check to prohibited to join mainnet/testnet. This results minimum change to nearcore codebase yet still prevent user from accidental mess up testnet/mainnet with the local-test-only features. The disadvantage is adding to the compilation time, but in practice this is okay because:

  • Most compilation time spent on nearcore dependencies instead of nearcore itself. So even compile only contract runtime it’s not much faster
  • Compilation time can be fully removed if we do additional work on DevOps, such as release binary or dockerfile, integrate them with near truffle equivalence

Second, we do need additional features like patch state and time travel. These are more complicated to implement in near-sandbox than in near-sdk-sim as it involves more layer of components, but it’s still doable based on this proof of concept draft: feat(rpc): patch contract code/state, account/access key for sandbox node by ailisp · Pull Request #4287 · near/nearcore · GitHub

The major benefit over near-sdk-sim is near-sandbox provides a uniform interface comparet to a real node. Developers can write test with RPC-based primitives, such as deploy this contract, call this method, expect that result, etc and have it run on local near-sandbox and on testnet in uniform way. It also gets realistic experience and real timing/gas consumption for free. And it’s framework agnostic, developers don’t have to write integration test with near-sdk-rs.

Alternatives

First, we don’t consider any near-vm-runner-standalone based approach, which acts on a lower level of directly applying single action based on some ApplyState context and inconvenient for contract developers to execute cross contract, multiple-action transactions.

near-sdk-sim based approach

Developer write tests with near-sdk-sim. near-sdk-sim gives gas breakdown, and in order to make it align with real node, contract states, fee configs, execution context (block height etc.) should be fetched from real node and provided to near-sdk-sim. We assume all this information in node-snapshot.json (so the download step is also designed to output such a file). This node-snapshot.json is a serde-json-deserializable GenesisConfig struct defined in the near-sdk-sim module. And the main design choice is to use the GenesisConfig struct to reproduce as much as context, chain config, fee config, contract state, etc. to reproduce the execution result and gas breakdown of a testnet config. The benefit of this approach is:

  • Current interface of near-sdk-sim takes only a GenesisConfig and mock chain behavior of a real node like producing blocks. Down to nearcore runtime.apply is mocked and starting from nearcore runtime.apply is a black box that has the same execution logic as real node. GenesisConfig is the only thing you can configure the node behavior without architectural change to near-sdk-sim
  • It can simulate most cases of logic and all cases of fee like on a real node
  • For the logic case it cannot reproduce, it’s mainly block_hash, but that’s also difficult in alternative near-sdk-sim based approaches. Because near-sdk-sim executes the single transaction but real nodes execute multiple on a block, therefore the content and hash of the block is different.

Some people may think of override per ApplyState in near-sdk-sim for a more fine-tuned reproduction. It doesn’t look necessary to mutate more information than ApplyState to reproduce a transaction. I cannot think of a valid use case of override per block ApplyState. The thing we cannot be configured at genesis and let runtime calculated is (prev_)block_hash, rand_seed and epoch_id.

  • user would rely on whether block_hash exist for some contract, but not the exact content of block hash
  • User cannot rely on rand_seed to be deterministic, run the txn on a real node will be scheduled at unknown block, and if it’s cross contract it’s not guaranteed next transaction is scheduled 1 block after the first thus it’s does not make sense to reproduce a function result based on random_seed to remain unchanged
  • epoch_id is also a hash, contract also based on epoch whether it exists and epoch height, not the content of epoch_id

Additional RPC node on special mainnet/testnet approach

Add RPC and only enable on one our hosted RPC node:

  • dry_run

The dry_run command takes the same arguments as broadcast_tx_commit plus optional wasm files to override contracts called in transaction, and does not apply state update to chain. The response include same broadcast_tx_commit response, plus the gas breakdown of this transaction.

This approach mostly credit by Bowen Wang. It does not strictly fit the requirement, but the resulting effect and developer experience is close.

Effect (functionality requirement)

  • Developers should be able to dry run contract without deploy, and it doesn’t breaking deployed contract on chain.

For this requirement, dry_run rpc does not change any state on chain, include contract itself and contract states.

  • Developer can reproduce a testnet transaction locally (or without running a node with full states by themselves) in order to see gas breakdown

Developers can reproduce the transaction by dry_run rpc. But just in order to see gas breakdown they don’t have to reproduce, instead, we’re agreed to enable gas profile by default, that they would already have this information when they got response of submitting transaction.

Developer Experience

The original purpose of the goals of “dry run and local reproduce” is developer doesn’t have to download many gigabytes of states, recompile nearcore with special feature etc. And rationale for local reproduce is to accelerate the process of debugging gas breakdown, because user can tweak the contract, rerun to see a new gas breakdown. So although this approach doesn’t actually locally run the transaction, it does solves the problem:

  • developer only need curl or near-api-js to call two new rpcs, no full local node
  • developer can now see the gas breakdown of executed transaction, without the need to rerun to reproduce it with extra gas info
  • dry run rpc doesn’t change the actual chain’s storage, although it’s not a local dry_run, from contract developer’s perspective they don’t care

Justifications of not doing alternatives

near-sdk-sim based approach

The major problem of near-sdk-sim is it doesn’t behave in the same way as how a transaction scheduled and executed in the real node. A lot of components or layers are either removed or mocked. Also, it’s lacking RPCs from real node, so that developer cannot query transaction, state, etc. from it in the same way when interact with a real node. This can be added separately but it’s a duplicated work, and even if it’s added, due to the mocking nature, it can be different that developers need to learn two sets of tools: real node testing infra vs sim testing infra.

Also it’s framework dependent, the developer has to write rust integration test.

Additional RPC node on special mainnet/testnet approach

Dry_run isn’t as convenient as run download state locally into near-sandbox node and “tweak contract code - redeploy - rerun” workflow. Also, it doesn’t gives you special power such as patch state and fast forward in time.

User story of using near ganache/truffle

This section gives an overview of how developers are going to use near-sandbox node and near truffle equivalence tooling in real world. Note truffle equivalence is out of scope of this design doc, but the near-sandbox alone cannot describe the developer workflow so the features expected from near-truffle is briefly mentioned.

Write some tests and run test on real node and on near-sandbox

The test is like deploying a contract, call contract method, expect some result. This is the most common kind of test that user would expect. User’s workflow will be this:

  • Write a test with near-api-js that does deploy contract, call, expect result. near-api-js can take either a local near-sandbox rpc url and a mainnet rpc url.
  • Write some other test that only works with a local near-sandbox rpc url, because it use near-sandbox specific RPCs, like path_state and time_forward. This and above are like writing truffle test in javascript.
  • Run test with near-cli: near test. This is like truffle test.

See gas break down on a real node

It is planned to be enabled by Add config option for gas profiling by matklad · Pull Request #4262 · near/nearcore · GitHub on all node. And once enabled, this information is automatically available when user receive the repsonse of a transaction (or just view it in explorer)

Reproduction of testnet transaction on a local node

User download state with our hosted special RPC node, which has unlimited size of state export. User write a test that patch local near-sandbox with downloaded state and then call contract functions in test.

Note: How is the exported state imported from near-sandbox?

There are two ways of doing this.

  • Aggregate contract states that you’re about to reproduce into records in genesis-config, and start node with genesis file. (local file, quite similar as start neard). The benefit is straightfoward to setup reproduce because we can have a script that user specify account, contract account id, etc., and this tool do download from testnet, aggregate it into a genesis file.

  • Start scratch node, call near-sandbox’s special rpc patch_state to import the exported state (also a local file). The benefit is you can do this as a step in test on the fly.

My plan is build second approach first, since effect of the first can also be achieved with the second, but not reverse. We can do 1 in the future since it’s really convenient for developer.

Dry run, or simulate contract behavior and gas usage on a local node

This is doing in the same way as reproduction of testnet transaction

Implementation Plan

On nearcore (near-sandbox)

proof of concept near-sandbox

Add sandbox features to all near crates that has a difference in sandbox / normal node. In neard crate, detect and exit when sandbox feature enabled neard is trying to join testnet/mainnet. Add patch_state, time_travel rpc enabled when sandbox feature enabled in near-jsonrpc. These new rpcs call methods, also guided by sandbox features in near-chain, near-client, near-store, etc, to apply changes to blockchain state.

patch state RPC

This RPC takes a list of states to patch. To make it easy for developers, this is in same format of state record, so developer can paste and tweak from genesis. There’re more kinds of state record in nearcore, but only four kinds of them should be used for patch state: Account, AccessKey, (Contract)Code and (Contract)Data. patch state only has add new key-value pair and mutate value of existing key, no delete, because developers are expected from a minimum state and add their contract code, contract specific and states, auxiliary accounts/keys.

On implementation, the RPC calls new helper function in Chain, and chain stores the pending states_to_patch, and pass to ChainUpdate when Chain::process_block_single. Then ChainUpdate pass the state_to_patch to RuntimeAdapter and finally to Runtime, which apply the state_to_patch after apply transactions and receipts. In practice, patch_state RPC is initiate by a javascript function call with near-api-js, and there won’t be transactions and receipts to be processed in the same block, so we don’t need to consider whether transactions or states_to_patch should be processed first in a block.

Time travel RPC

This has been implemented quite comprehensively in this PR, with usages showing here: https://github.com/near/nearcore/pull/3661/files#diff-6625191f7707675724047c058d796999a2b0660da3ce1b6226992506ca40eecb
I think we just need more approves to merge the PR and reuse the functionality made by this PR to expose it via a sandbox_time_travel RPC

On DevOps:

  • Configure mainnet and testnet archival node to have unlimited state size download
  • Release near-sandbox binary together with neard so contract developer doesn’t need to compile to use it

Note unlimited state size download is a controversial point that not agreed by every one. An alternative here is user have to run node locally to join testnet/mainnet and sync to obtain state > 50 kib.

NEAR Truffle Equivalence

As said near-sandbox alone cannot finish the user story, and a truffle equivalence component is needed. The actual way of design and implementation of this part is out of scope of this documentation. We will also depreciate existing testing approaches (sim test & unit tests) in favor of this. There is one possible way of doing it, not means to be the resulting way and pros/cons are not carefully examined:

On near-api-js:

  • Add api wrap patch-state, time-forward, download-state, download-gas-breakdown RPCs

On near-cli:

  • Add a near test command that’s some javascript tests written with near-api-js
  • Optionally it take near-test-config.json that indicate which near-api-js config should be used
  • Optionally make a near-cli start-test-node command which download the near-sandbox binary and run it

Additional notes

A big benefit of writing tests in JavaScript is people usually will write frontend and call contract methods in frontend with near-api-js, they can immediately use it in frontend by taking these js test code.

I would also suggest to build truffle-like tool on top near-api-js and near-cli rather than start a new tool, because many of the features are already available, just not yet in an easily accessible way.

In long term, we should also consider adding a truffle migrate equivalence command in NEAR, people are often suffered to figure out best practice of doing contract state migration.

Appendix - historical discussions

First we will not implement a CLI in favor of developers prefer to write test.

Then back to the two goals:

  1. see gas breakdown of a testnet transaction

  2. dry run some contract methods without deploying the contract

Bowen Wang points out to see gas breakdown and to dry run the contract on real testnet doesn’t have to be in near-sdk-sim. It should happen in a specialized rpc node because we’ll need a specialized rpc node anyway. And it’s more difficult and imperfect to reproduce the transaction in a simulator than let it execute on the real node and expose this gas break down/start a dry run on rpc.

Aleksey Kladov and reproduce points out the major reason that contract-runtime was considering reproduce the execution with local near-sdk-sim instead remotely on a rpc node (or require developer run a testnt node). That is, to allow developer to dry run contract locally without deploying to a real node, to fast iterate the debugging process. In rpc approach this is done by a dry_run rpc which takes {contract_code, params}

My proposal is

  1. use a special rpc node, expose gas breakdown for executed transaction. does not add rpc to dry run txn on real node
  2. download “node_snapshot.json”, described in near-sdk-sim based approach section, use near-sdk-sim to load snapshot to do a good-enough re-run of an existing transaction. near-sdk-sim also has already good dry run infrastructure, just need loading state from testnet.

For the known limited case of not reproducting exact result Bowen Wang concerned, such as block hash as user doesn’t noThe goal of the reproducing is for developer to debug and understand the bottleneck of gas spent, therefore they can doing a little change to avoid known imperfect repro while still reflecting the gas usage, and debug on that.

Max Zavershynskyi proposes the following:
Completely abandon the idea of simulating nearcore runtime in near-sdk-sim. Instead commit fully to having a Ganache-like version of neard that exposes controls like: fast forward time, patch state of the contract. Our contract test suite then will be running tests by executing them against RPC (whether it is a local node or a remote Mainnet node), copying Truffle DevX.

The benefits from it:

  • Contract testsuite provides DevX familiar to existing blockchain developers working with Ganache/Truffle;
  • Testsuite is entirely decoupled from client implementation. The same testsuite can be used with Rust client implementation or any future alternative implementation;
  • We don’t need to maintain simulation code outside the nearcore;

The disadvantages:

  • We need to make compilation of the local node really fast (e.g. by making it compile in non-optimizing mode) or otherwise contract developers will complain about deteriorated DevX;
  • Bo: big state to run is local node is a bigger problem that developer will complain: they’ll download 100G of state to sync mainnet node. patch state of contract functionality may address this

From Josh Quintal:

I agree with the approach of abandoning near-sdk-sim in favor of focusing on a Ganache-like neard. Users will tolerate a long install time if the functionality afterward is fast. We can iterate on that. Installing Ganache-like neard globally would limit the long installation process to initial install and updates.

I’m not sure of the performance differences between neard and near-sdk-sim around testing, but as long as there’s reasonable parity we can iterate on test speed as well.

Additional functionality:

  • Configurable block times
    • Instant blocks in Ganache were a great benefit for new developers and essential for running tests. Usually more experienced devs would assign a short block time, especially for UI/UX testing.
  • Import state
    • I’m not sure if this is easier NEAR’s architecture so I’m not recommending anything here, but for reference: Ganache handles this via its “forking feature”. The user supplies an endpoint URL, then Ganache monitors requests, querying the live network for missing data. For example a tx against a mainnet contract would first duplicate the contract and its state in Ganache and run the tx against that.
  • Configuring consensus parameters

Appendix - additonal features I found helpful after try initial implementation

These features will not appear in initial near-sandbox, it’s found after we finalized initial design of the near-sandbox. But if it’s confirmed by a wider community, it’s going to be implemented in future versions.

  • Sync version of patch_state. Currently, patch_state is applied on next block, but the sandbox_patch_state RPC is an async one that won’t wait for state applies
  • sandbox_snapshot and sandbox_revert RPC, which take a snapshot and revert blockchain state on the fly. The benefit is saving test time by using these methods on beforeTest setups. The alternative is shutdown node, dump state, restart between tests, both harder and slower to automate
4 Likes

I like this idea. I think ultimately we’ll need a NEAR light client capable of doing this not just for debugging purposes.