[PROPOSAL] Limited replayability / Limited past protocol version support in neard

Summary

Under this proposal tools and programs implementing the NEAR specification(s) will no longer commit to perpetual support for all past revisions of the protocol. Users wishing to interact with the historical blocks and other entities pertaining to the older protocol versions should use the corresponding versions of the tools and programs.

Motivation

Today any development to neard is required to maintain the exact behavior as seen in all past versions of the program. Among other things, this is necessary to enable replay of the blockchain all the way back to the genesis, as implemented by the various neard view-state sub-commands.

Some of the prerequisite infrastructure to reconcile the operational needs and the desire to modify the protocol already exists in the form of protocol versioning. Practice has nevertheless demonstrated protocol versioning to incur a significant hit to the development velocity. Every time a change is made, engineers need to carefully consider the interaction between the change and protocol-visible behaviour. In cases where the change is intentionally a breaking change to the protocol, care needs to be taken to also maintain the former behaviour for the previous version of the protocol. In extreme cases the only feasible way to maintain compatibility is to duplicate the affected subsystem in full. First time this happens, logic to switch between different versions of the functionality must be implemented as well. Such logic acts to further impede future evolution of the implementation.

We are not able to consistently verify whether our efforts to maintain compatible behaviour are being implemented correctly in practice. Verifying the protocol versioning is implemented correctly by replaying even portions of the history is an extremely time-consuming process (taking weeks to months per experiment) and requires significant effort to set up. This makes verification of code changes quite onerous, to the point where there has only been One Attempt on behalf of the Contract Runtime team back in 2021.

It would be ideal, if we could make the requirements for compatibility satiable by construction and remove the burden imposed by this functionality on the development process. This proposal suggests an approach at an expense of negligible relative increase in operational complexity and additional coordination.

Proposal

After the implementation of this proposal a release of neard would support either:

  • the latest protocol version only; or
  • the latest protocol version and one preceding protocol version only.

Supporting the latest protocol version only achieves correct support for the corresponding protocol version by construction. By using a binary dedicated for each version, we have a much stronger guarantee that exactly the same code is used to handle entities at a specified protocol version. Little effort or consideration is necessary in how new developments and improvements interact with the project’s legacy. On the other hand, this is perhaps quite at odds with the protocol upgrade roll-out negotiation that occurs as part of the routine release process. With every release of neard the deployment process that follows is largely a two-step affair. Today, operators first deploy a newer version of neard on their systems. At some later time, a vote occurs at a protocol level to switch to a newer protocol version. This requires some coordination between all involved parties, but the truly synchronous portion of the process – the protocol upgrade vote – occurs automatically.

Toning down the proposal a little, supporting the latest protocol version and one preceding protocol version with each release of neard is a nice intermediate step. This way we can still benefit from development velocity and correctness improvements, albeit partially, but operationally no observable differences to the validators would be expected, and the implementation doesn’t need to deal with any changes related to protocol version upgrades.

Implementation of latest-protocol-version-only scheme

NOTE Ideas expressed in this section haven’t been verified to be implementable. If you see any significant issues with this proposal from the standpoint of either an operator or an implementer, please leave comment.

One of the options to implement the one-protocol-version-only scheme would be to have neard cleanly terminate or exec a new binary whenever requested to switch to a more recent protocol version. In such a scheme, neard startup would consult with a configuration file, or a similar convention to figure out the binary for the required protocol version:

56: "/opt/bin/neard-1.29"
55: "/opt/bin/neard-1.28"
...

From the operational standpoint this is perhaps even nicer than the setup we have today – deploying a new release is effectively dropping a binary into a predetermined location and appending a configuration file. That said, it is important to acknowledge there are new risks as well – it is much harder to confirm that the new binary is able to perform at all, for example. The newly deployed binary will be invoked for the first time at the protocol version upgrade time, which leaves a much smaller window to resolve any mistakes.

If implemented naively, this approach will have some unpalatable consequences on pacing at which blocks are processed during the protocol upgrade – spinning up a new neard instance takes significant amount of time. These are all solvable with additional implementation effort. Peer connections can be inherited, the process for a new version can be launched some time ahead of the switch so it has time to warm-up, etc.

Security Implications

This change should overall improve the security and reliability of the implementation in the long run. With developers being able to make changes more confidently, without concerns for backwards compatibility, the overall health of the code base will improve. As a direct consequence of this change, fixes to vulnerabilities can be implemented and rolled out more quickly.

Drawbacks

The major drawback for this proposal is the more difficult set up for replay. Any replay involving blocks at multiple protocol versions would need to use corresponding releases of neard to apply the blocks. Compared to the overall effort to set-up a replay, this additional requirement should be negligible.

This proposal introduces additional complexity at the intersection of resource control, hand-off and protocol upgrades. Implementation will need to extend the current protocol upgrade mechanism significantly to enable the functionality required to implement this proposal.

8 Likes

NB: this proposal likely does not involve any changes to the protocol per se, and as such it doesn’t really fit the NEP process. Unless the discussion reveals any reason to modify the protocol as part of this proposal, this thread is the RFC.

1 Like

I am very much in favor of limiting how far back a binary can replay old blocks! Two thoughts.

  1. Should the supported range be defined as a number of epochs rather than number of protocols versions? (Think about what happens if a bad protocol version is released and replaced within the same day.)
  2. The impact on challenges (sent after an upgrade for blocks before the upgrade) is really not discussed, yet. I think that’s an important case to consider but input from someone from the core team would be more useful than my guesses.
3 Likes

That’s a good point. Would it work if nodes are required to keep around previous binaries for some time in order to handle this.

Perhaps @minzhang or @mm-near can answer that?

I like the idea very much. We should also consider what to do when a bug is discovered that has been present in previous neard versions. We may need to patch the fix to all affected versions and re-release.

2 Likes

Good point. I suppose we will need some infra or manual process to investigate all prior versions when we discover a bug.

A similar and harder challenge would be some way to find bugs in older versions that are not present in the latest version. I suppose how much do we care about them as in how big an impact they can have.

Thanks for writing down the proposal @nagisa.

I agree, that we need to have a solution for this problem, as it will be harder and harder for our code to support very old blocks - without a ton of engineering debt.

But I think that what you’re proposing here is a little bit too drastic - supporting just the latest (or latest-1) versions only - and I’d strongly prefer a time (epoch) -based solution instead.

For starters - we could try doing a one-off - deciding the lowest protocol version (50?) that we’ll be supporting going forward and:

  • double check that current code at HEAD is actually able to correctly produce all the past blocks before this version
  • cleanup the code.

For starters - we could try doing a one-off - deciding the lowest protocol version (50?) that we’ll be supporting going forward and:

I believe a one-off experiment to be one of the worse options we have available to us. It is important, I feel, that whatever approach we take, it is applied consistently and automatically. A one-off does not achieve either of those properties.

With a one-off experiment we may achieve a goal of removing some dead code, but at what expense (and to what other benefits?) If we don’t adjust our tools to handle the ad-hoc drops of past protocol version support at the same time, interacting with the NEAR protocol may become just way too clunky and manual for some use-cases. In order to avoid that we’d need to adjust our tooling in some way. Making a one-off change is going to be a great motivation for us avoid all of that extra work. It will not act as sufficient motivation for us to develop new functionality such that they support dropping old protocol versions easily either, which only means that dropping support for protocol versions would remain a painful and time-consuming procedure going forward.

Dropping support for protocol versions needs to become a natural part of the project. It needs to be done consistently and predictably. If the consensus ends up being that supporting 1 or 2 past protocol versions per release is too drastic, I’m happy to consider alternatives, whether they mean supporting more versions or supporting them for a specified amount of time.

That said, even if drastic, I believe that supporting just one protocol version per release is ideal from the developer’s point of view. In practice a scheme like this would mean that we don’t need to worry about a dependency update being a protocol breaking change. We don’t need to worry about duplicating the code, even if briefly. We can always fix security vulnerabilities the way that seems proper, without spinning our wheels debating how to fix them while also preserving the original behaviour (what if the behaviour is the vulnerability?) neard could be developed as-if it was a “regular” project all the time, to the great benefit of everybody’s productivity. Although I don’t have any data on the current protocol versioning scheme being an insurmountable roadblock for new (especially external) contributors, I would be quite surprised if it didn’t consistently and quickly come up at least as a snag. And snags hurt retention.

Supporting two protocol versions trades off many of these benefits for a more straightforward operational model. What value would supporting further additional versions have? Why is a time/epoch based model better? What would be the tradeoffs?

1 Like