Disclaimer. I cannot say to be entirely impartial in this discussion, since I lean towards “Wasm suspension” and “Wasm after Wasm” solutions while disliking both of them, but I tried to write this post objectively.
Background
It is pretty obvious that contract development with synchronous/blocking cross-contract calls is easier than with asynchronous calls. The reasons are the same as with regular non-blockchain development – one does not need to think about non-atomicity, synchronization primitives, callbacks, deadlocks, race conditions, etc. Unfortunately, just like with non-blockchain development the path to scalable applications lies through merging concurrency with the business logic. NEAR blockchain is scalable by design and so its contract development experience has embraced inevitable asynchrony from the first days. Unfortunately, even though asynchronous smart contract development is likely to be largely adopted in the future, forcing it on all NEAR developers creates an adoption barrier which inhibits our mission of making DevX as smooth as possible. Specifically, async contracts are uncommon in the most of the blockchain world which makes existing contract development patterns non-transferrable to NEAR. Additionally, even in web2 world, typical developers do not need to understand most of the intricacies of concurrency. It is clear, that NEAR has to allow contract development without intricate knowledge of concurrency.
Developer Needs
Concurrency goes against several developer needs. We need to separate them since the proposed solutions might focus on some of the needs more than on the others.
Transaction Atomicity
Developers want to know that when their code interacts with multiple contracts in a single transaction it is always “all or nothing”, meaning if one contract fails then so should all others, including those that have already been executed. There are actually two reasons to care about atomicity:
- Security – asset operations should be only performed when “everything goes well”. Taking care of asset recovery when transaction fails midway introduces risks, because it is more error prone and adds potential abuse angles. Same goes for operations on any critical data and not just assets;
- Business logic complexity. For async contracts, assuming that any contract can fail, the number of cases that developer needs to deal with is worst-case exponential from the number of contracts involved. While for sync contracts this number is always one. Savvy non-blockchain developers learnt to drastically reduce the number of cases to linear, by using concurrency architectural patterns like lock hierarchies. Unfortunately, these methods are very complex and the number of cases is still higher than one.
Code Simplicity
Code gets rapidly more complex when developer needs to deal with callbacks, channels, locks, actors, etc. It is a blessing that in many languages developer can just add async/await keywords to their synchronous code and magically make it scalable. Unfortunately, it is not a universal solution:
- It does not provide atomicity just by itself;
- It does not help with resource re-entrance, e.g. we would still sometimes need to lock the balance before the transfer;
- Mindless usage can lead to very unscalable applications, e.g. having overly aggressive resource guards/locks;
- Sometimes developer is forced to understand internals of async/await, which are quite complex.
Therefore, concurrency leads to code complexity, except for the cases when mindless async/await can be used.
Interoperability
Synchronous/blocking contracts are highly interoperable. Most of the time you only need to know the interface of another contract to interop with it. With async contracts, you also need to know the state of another contract, and even understand its internal state machine. Also, the interface can be statically enforced which prevents incorrectly constructed interop from being deployed, while the state machine interface cannot.
General Direction
A large number of web2 developers do not use explicit concurrency in their code today, except for decorating some of the code with async/await keywords. And there is a smaller group of developers that have deep understanding of concurrency who take care of the projects that require it. NEAR developers, on the other hand, are almost universally required to work with concurrency via callbacks and state machines. NEAR developer experience needs to match the distribution of the skillset among web2 developers for seamless developer adoption. Therefore, NEAR needs to have one or more solutions that allow a large number of the developers to develop on NEAR without working with concurrency, while still reserving tools concurrency experts to build highly-scalable applications.
Proposed Solutions
Fortunately, there are multiple ways for implementing synchronous experience in asynchronous environment. We split proposed solutions into several categories:
- Emulation of sync environment on top of async environment;
- Extensions of asyc environment with sync functionality;
Emulations
SDK lock
When contract has #[lockable]
decorator, SDK would automatically create a lock for the contract that prevents it from being interacted with until the callback to this contract is getting resolved. All cross-contract calls would be forced to have callbacks and if there are multiple cross-contract calls then their callbacks will be merged into one using promise_and
host function. The contract will fail if it is in “locked” state and being called by something else except the callback. If any of the callbacks indicate contract failure, the state is reverted in callback. The problem with callbacks failing because they run out of gas is solved by always writing state into a temporary storage first which gets committed on the callback or abandoned when callback times out.
Also, instead of locking the entire state of the contract, the lock might be locking a subset of keys predefined by the developer, e.g. #[lockable(key1, key2, ...)]
. Carefully implemented contracts might be able to run in parallel if they store user data under non-overlapping keys.
Pros:
- Partially addresses atomicity need. In the chain of contract calls A->B->C->D, A and B are reverted it C or D fail, while C and D are not reverted if A or B fail on callbacks;
- Does not require protocol changes;
- Can be deprecated from SDK and old dApps will continue working well. Which means we can iterate on its design as much as we want;
- Works for all contracts, including those on different shards;
Cons:
- Does not address simplicity and interoperability needs;
- It is a gray box for developers – they still need to understand on surface level what is happening;
- Off-chain code needs to differentiate transactions that fail because of the lock;
- During congestion lots of transactions will consume gas, but will time out on callbacks;
Wasm suspension
We add a host function that exists the contract, records its Wasm memory and state diff into a temporary storage and waits for certain callback to be resolved, upon which it resumes Wasm execution. On SDK side developers use simple blocking interfaces to another contract which behinds the scenes does suspension and promises/callbacks. When contract interacts with state key-values it needs to grab either read or write reference which gets dropped on the callback. If another transaction tries to grab write reference to a key that has live read or write reference, it will fail.
To partially address the situation when transaction arrives at a wrong time and tries to grab write reference while there is another suspended contract holding a reference to the same key, instead of failing this transaction immediately we have two options:
- Put receipt back into delayed receipt queue, but deduct burned gas;
- Require such contracts to declare keys that they are going to touch so that we don’t schedule them for execution, while there is a competing reference being held by a suspended contract. This will potentially create livelocks though.
Pros:
- Addresses atomicity, simplicity, and interoperability needs, since it will look exactly like regular sync function calls;
- Works for all contracts, including those on different shards;
- Only changes Contract Runtime and Transaction Runtime is unchanged, except for taking care of state diff;
- Productionizing Wasm suspension tech might be overall beneficial for improving quality and understanding of our Wasm VM;
Cons:
- Requires production-quality Wasm suspension tech;
- Modifies Contract Runtime;
- Will cost >2x more gas to write into the state and suspend Wasm execution;
- Off-chain code needs to differentiate transactions that fail because they attempted to grab a reference at a wrong time;
Notes
Both approaches because they don’t use scheduler lead us to make an unfortunate choice:
- Either some transactions will fail because they arrive at a wrong time;
- Or design will allow some form of live-lock.
Extensions
Wasm in Wasm
The idea, suggested by @illia , is to add a host function that allows calling another contract without terminating the current contract, in a blocking way. This, however, requires contracts to be guaranteed to be on the same shard. Potential solution to this is to see it an enclave that can host multiple contracts with different states, which cannot be split by resharding. An additional feature of such enclave is an access control system for the state and contract code which allows code and state of one contract to be accessed by another contract.
We can choose to either preserve the existing cross-contract API and require people to still use promises/callbacks in SDK, or we can add new host function that allows a classic blocking call.
Pros:
- Atomicity and Interoperability;
- Simplicity – depending on cross-contract API;
- No new Wasm tech;
Cons:
- Might create complex resource interferences and removes some optimizations. During transaction execution we would need to have multiple Wasmer modules running at the same time. Each Wasm module requires resource allocation which makes it difficult to argue whether that would force shards compete for the resources. Also, we won’t be able to do performance optimizations like module pre-loading or reusing one Wasm memory for all contracts, since we won’t know the Wasm modules that need to be loaded ahead of time;
- Breaks “no collocation incentives invariant”, see below;
- Requires to define new type of collocated state with access control system, which will increase the complexity of the runtime;
- Does not work for all contracts;
Wasm after Wasm
Similarly to “Wasm in Wasm” we could introduce an enclave with the new type of state and access control system. However, in this solution we will allow contracts to schedule a contract call in a regular way but indicate that it needs to be executed immediately after the current contract, in the same block. This will fail for contract calls outside the enclave. However, it can be extended to work with all types of receipts and not just contract calls. The benefit of it is that we still execute one contract at a time, and the perspective of systems like Explorer/DevConsole there will be no modification to the receipts.
Pros:
- Atomicity;
- No new Wasm tech;
Cons:
- Does not fix contract development simplicity or interoperability;
- Requires to define new type of collocated state with access control system, which will increase the complexity of the runtime;
- Breaks “no collocation incentives invariant”, see below;
- Does not work for all contracts;
Intra-Shard Scheduler
We could allow one or all shards to run multiple runtimes concurrently, which will allow executing multiple cross-contract calls within the same block, as long as they fit into the attached gas limit. Unfortunately, this requires developing complex scheduling mechanism and requiring transactions to be annotated with key-values that they will touch during execution. Since contracts would need to guarantee collocation on the same shard, we would need to also implement some form of enclave or pinning, similar to Wasm in/after Wasm.
Pros:
- Atomicity;
- Increases single chunk capacity;
- No new Wasm tech;
Cons:
- Requires scheduler and transaction annotations;
- Does not fix contract development simplicity or interoperability;
- Increases hardware requirements;
- Requires to define new type of collocated state with access control system, which will increase the complexity of the runtime;
- Breaks “no collocation incentives invariant”, see below;
- Does not work for all contracts;
Notes
Runtime Invariant
Current NEAR runtime was designed with an important invariant in mind – there is no financial or other incentive to collocate with some smart contract on a shard. The way it processes transactions and receipts explicitly enforces this invariant. This allows to things:
- Future proof. We can indefinitely iterate on protocol designs without being afraid of breaking some contracts apart. This allows aggressive strategies around resharding and shard size reductions which can lead to performance optimizations and/or reductions in hardware requirements;
- No perverse incentives. We decided to completely eliminate the need to think of scenarios where projects would fight to be collocated with some popular project, like and exchange. Game-theoretic scenarios might get quite complex and we don’t want to deal with this extra complexity.
Both Wasm-in-Wasm and Intra-Shard Scheduler solutions require contracts to be collocated together, which breaks this invariant.
Pinning and Enclave
A potential more light-weight alternative to an enclave with access control system is allowing contracts to “pin” themselves to another contract upon deployment. Implementation-wise it is easier to do, since we only need to position this contract under the state trie of another contract. However, it would still require modification to Runtime code and it will still break the invariant. On the positive side, we can later roll out access control system incrementally on top of it.
Comparison
Qualities | SDK lock | Wasm suspension | Wasm in Wasm | Wasm after Wasm | Intra-Shard Scheduler |
---|---|---|---|---|---|
Implementation Complexity | High | Medium | High | Medium | Very High |
Atomicity |
![]() |
![]() |
![]() |
![]() |
![]() |
Code Simplicity | ![]() |
![]() |
![]() ![]() |
![]() |
![]() |
Interoperability | ![]() |
![]() |
![]() ![]() |
![]() |
![]() |
No Transaction Runtime Modifications | ![]() |
![]() |
![]() |
![]() |
![]() |
No Contract Runtime Modifications | ![]() |
![]() |
![]() |
![]() |
![]() |
No New Transaction Failures | ![]() |
![]() |
![]() |
![]() |
![]() |
Works for All Contracts | ![]() |
![]() |
![]() |
![]() |
![]() |
No Contract Code Modifications | ![]() |
![]() |
![]() ![]() |
![]() |
![]() |
Preserves Runtime Invariant | ![]() |
![]() |
![]() |
![]() |
![]() |
Increases shard capacity | ![]() |
![]() |
![]() |
![]() |
![]() |
Does not increase hardware requirements | ![]() |
![]() |
![]() |
![]() |
![]() |
Legend:
- Asterisk means “almost”.
-
/
– depends on whether we keep Promise API or have blocking API for Wasm in Wasm solution;
Call for Action
Before we select a solution(s), this becomes a NEP, and enters the quality control process, please contribute your thoughts and ideas in any form in this thread.