Namada’s height-activated protocol upgrade failed to activate on the testnet at block height 37370 (18th of January 2023 between 19:00-21:00 UTC). This post is a diagnosis of what happened and the steps we plan to take to avoid this kind of issue in the future.
Transactions rejected by the validity predicates
The most recent Namada public testnet spawned on the 12th of January 2023 at 17:00 UTC following a decentralised genesis process, with Namada’s release version 0.13.0.
The network was running reliably after genesis, but many community members reported issues with transactions. The core team investigated and found the issue in the 2 fields included in the genesis files called whitelist_vp and whitelist_tx. Transactions were accepted and added to the mempool, but once executed and inserted to the blocks, validity predicates rejected them.
These fields contain the transaction and validity predicate hashes of allowed WASM files. The problem was that these 2 arrays contained the hashes in lower case style. Before including transactions, Namada checks that the executed transaction and validity predicate have their hashes inside the whitelists. Specifically, &vp_hash.to_string() and &tx_hash.to_string() return the upper case hash, so the checks failed. The fix consisted in including the hashes in upper case style in the genesis file. The code was fixed by writing the 2 arrays in storage all in lowercase and making the checks case insensitive.
To deploy this fix, the network required an upgrade to the patched protocol version v.0.13.1. To give enough time to coordinate a decentralised upgrade without halting the network, the upgrade was programmed to happen in the future at block height 37370.
Protocol release process issue
On the 17th of January, the Namada community was informed about the release and the instructions to upgrade the network with the fix. The issue was that there were two release versions published v0.13.1 and v0.13.1-hardfork. Several clarifications in communications were issued to upgrade their nodes to v0.13.1-hardfork. Unfortunately, some validators upgraded to v0.13.1, while others upgraded to v0.13.1-hardfork, which resulted in two different state roots at block height 37370. The network forked into two, with neither fork having sufficient voting power (2/3) to make progress and continue producing blocks.
During the investigation, the team also found out that both protocol versions contained another bug that would’ve prevented nodes from synching from scratch.
Fixed protocol version and deployment paths
A new protocol version containing the right code was created: v.0.13.2. After careful assessment, the team found two options for deploying it: upgrading or restarting the network.
Option 1) Recover the network with another upgrade
This path required coordinating with validators in the community that were operating v0.13.1 to upgrade to v0.13.2 (but not to the ones operating v0.13.1-hardfork). When possible, recovering the network is always a preferred option, but in this case this option carried a lot of complexity and hence risk of failure, as it involved: creating a release version able to resync. from 0 (2.5-3 hours) (only relevant for nodes on v0.13.1) and then activate the hardfork; to test the release simulate a hardfork in a devnet; communicate and wait that all affected validators upgrade to the correct version.
Option 2) Restarting the network
This option required releasing v0.13.2, test the version in a devnet, and restart the network with the same genesis validator set as on 12th of January 2023 with the correct version of the protocol.
Going forward
Given the coordination complexity and risks with option 1 in a decentralised network, the team proposed to proceed with restarting the network. To avoid the same issues forward, the team has agreed to the following process improvements:
Devnets (core internal testnets for Q/A) configurations will be as close as public testnets as possible, including genesis configurations. This would’ve helped catch the issue with the hash whitelist check.
Sticking to one release only, which would’ve decreased significantly the risk of validators upgrading to different versions.
Join Namada's Discord server for feedback and questions.