DCP0004 or "On RCs and fully testing consensus changes"

DCP0004 has just been released for appreciation by the stakeholders and will (hopefully) be held for voting in the next few months. But what is the importance of it?

TL;DR

The 1.4.0rc1 and rc2 release candidates had an unintended change that caused several reorgs due to miners allowing transactions in blocks that previous versions rejected (a hard fork). Some (or one or a few) miners upgraded and started generating hard forked blocks and now we have to carry a new vote to correct this behavior so that Lightning Network transactions can function.

See the next section for a longer explanation of the events.

In the order of events

PRs 1471, 1500 and 1520 introduced a change in how the dcrd nodes handle utxos (unspent transaction outputs). This was not supposed to be a hard forking change and was introduced to fix some inconsistent behavior regarding the confirmation count of transactions (see issue 618 and the PR’s notes as a starting point of why this work was done).

In the course of correcting the semantics for the confirmation count of mined transactions, it ended up fixing a bug in the original implementation of the sequence lock validation, triggered when including a transaction with enabled sequence locks in a block.

This difference in validation rules caused nodes running anything older than the 1.4 rcs to reject blocks having the offending transaction while nodes running the rc would accept them.

Some PoW miners upgraded to the rc and started generating blocks that were rejected by the older nodes. In older nodes this was manifested by lines such as the following in their logs (this is the earliest error message I found in the logs of one of my nodes running an older version of the software):

2019-01-18 08:44:23.000 [INF] BMGR: Rejected block 000000000000000012f274c9066f002b43338492805a28ece2a031abbc3bee58 from x.x.x.x:y: unable to find unspent output 2ebfc374f31ceb18343619e028cc0924246ca5f597bc9408ff10cd8d82ea2f7b:0 referenced from transaction 495559c93da67b88b3ad92d8ecc4ec54df47fa441fdc56dc4f21728ee2782fe7:0

Newer nodes on the other hand, started seeing a bunch of reorg messages due to the PoS system working as intended. In a pure PoW blockchain the 1.4 miners would have diverged into their own chain and then users would have to individually decide which chain to follow possibly even resulting in the creation of a whole new coin. But the PoS component resulted in the 1.4 chain stalling after very few blocks causing the observed reorgs.

While these shallow reorgs (the maximum recorded reorg depth was of 3 blocks) for nodes that had already upgraded to the 1.4 rcs are not a huge problem when they happen during the regular course of network operation, if this went unnoticed or unfixed for much longer there was a risk that more VSPs or solo voters would start upgrading their voting wallets, generating deeper reorgs which would then start seriously disrupting users and forcing a network-wide effort for either updating to the new rules or reverting to the older.

At this time, issues started popping up about invalid blocks and stuck transactions (1568, 1573).

After a bit of head scratching and trying different things, @davecgh nailed down the problem to the combination of an existing bug in sequence lock enabled transactions and the changes introduced in the utxo semantics reversal PRs.

The next step was reversing the hard forking change (done in PR 1570) so that the next rc (1.4.0rc3) and eventually the final 1.4.0 version could be generated with the correct behavior in place.

While this corrects the issue causing the acceptance of different blocks in different versions, the buggy validation behavior of transactions with sequence lock enabled means that Lightning Network transactions cannot be used on mainnet. This in turn required the writing of DCP0004 and its corresponding implementation (PRs 1578, 1579).

For the next few months this agenda will be enabled for voting, and stakeholders will need to approve of this change for sequence locks in general and LN in particular to be usable.

This was a summary of all the major events in uncovering this issue. Now, for some more details and conclusions.

Quick review of utxo semantics

In versions before 1.4 (or more correctly before commit df1898c) the validation code considered a given transaction approved (or somewhat incorrectly referenced in some places as “valid”) only on the following block, assuming that (following) block approved the parent where the transaction was added. In other words, one could loosely interpret that as “transactions are not approved until the next block approves it”.

The “Reverse utxo set semantics” commit switched the code around, so that transactions are always considered approved unless the next block explicitly disapproved its parent. Again, one could loosely interpret that as “transactions are approved by default unless a future block disapproves it”.

This second interpretation generally makes more sense, given that (for example) an output of transaction is spendable within the block it was created, so it’s definitely “valid” for spending purposes, all other things being the same.

Several small inconsistencies in the software arose from using the first interpretation, which can be seen from reading the relevant issues and commits (starting at issue 618), so the utxo semantics reversal was implemented.

Origin of the Sequence Lock Bug

The sequence bug originated due to the interaction of these two bits of code as seen in the checkConnectBlock function for version 1.3.0:

Source: decred/dcrd/blob/release-v1.3.0/blockchain/validate.go#L2375:L2380

and

Source: decred/dcrd/blob/release-v1.3.0/blockchain/validate.go#L2429:L2439

The error is quite subtle if not pointed out or when thinking using the wrong utxo approval interpretation, but it consists of:

Loading the utxoviewpoint with the spent outputs of the regular transactions of the previous block (that’s done by using the ViewpointPrevValidInitial constant);
Then checking the sequence lock on transactions of the current block (done by iterating over the block.Transactions[1:]).

Can you spot the problem now? Given that the utxos are loaded for the previous block, attempting to check the sequence lock for transactions in the current block triggers an “output referenced from transaction either does not exist or has already been spent” error (due to the previous transaction not existing in the utxoView).

Sequence locks would only work if the previous block also contained a reference to the relevant prior transaction (or if this prior transaction was referenced from a stake transaction in the current block since block approval only affects regular transactions).

This is now captured on a set of tests that were implemented to ensure the legacy behavior (that is, the behavior with the bug) remains even in new versions of the software:

Source: decred/dcrd/blob/release-v1.4.0/blockchain/validate_test.go#L249:L252

1.4 Release Candidates & Hard Fork

After commit df1898c that bug was naturally solved because starting at it, the utxo set was loaded for the current block (the following is the same relevant spot in checkConnectBlock, but this time in version 1.4.0):

Source: decred/dcrd/blob/release-v1.4.0/blockchain/validate.go#L2939:L2947

Notice that there is no reference to the ViewpointPrevValidInitial constant and not even a pointer to the parent block. So the utxo set is loaded for transactions spent in the current block and the status of sequence locks of those transactions can be checked correctly.

Due to the difference in these two implementations, blocks produced by miners using the first two release candidates for version 1.4.0 could include transactions with sequence locks enabled even if the transactions they were spending from were not included in a prior block. These blocks, however, would be rejected by versions 1.3.0 and lower of the full node.

Due to most (all?) VSPs not yet running one of the 1.4 rcs, these blocks were unable to get the required votes to keep the 1.4 only chain running, therefore causing several reorgs, as seen in this image (transaction 495559c9 had one of its inputs with sequence: 0, enabling sequence locks for it):

Buggy transaction included in sidechain blocks

DCP0004 Proposal & Voting

So now with the problem identified, the older (buggy) behavior was reimplemented by creating a special function createLegacySeqLockView which returns a utxo view that maintains the same older semantics of loading utxo set data from the previous block.

But in order to use transactions carrying a sequence lock enabled input, the behavior needs to be fixed rather than maintained.

This fix represents a change in consensus rules, so it’s a good thing that in decred we actually have a way of deploying consensus-changing upgrades without having to go through the drama associated with uncontrolled hard forks!

The process for upgrading the network has been started in the past week with the writing and implementation of DCP0004, which should be put for a vote in 1.4.0 as the nodes (and in particular miners, VSPs and stakeholders) start updating their nodes.

Preventing regressions & similar bugs in the future

The root cause for the original buggy behavior in the sequence lock calculation was lack of comprehensive integration tests that ensured that transactions carrying a sequence lock enabled input could be included in blocks under several different scenarios, such as when the previous transaction being spent was/wasn’t included in the parent block.

While there were several unit tests for decoding the sequence time lock value and ensuring it behaved correctly, writing tests that required a full backing blockchain setup in a specific state (e.g.: TestCheckConnectBlockTemplate) was a bit verbose and unwieldy (to put it mildly) and required generating very long chains before hitting the thresholds for activating the features under test, making the CI timeout and fail with false positives.

So the integration test suite was refactored across several PRs (starting at 1583) to simplify the procedure for writing such tests. PR 1589 in particular deals with simplifying the tests for the correct behavior of sequence locks after the agenda is successfully approved and activated and on the diff you can see the amount of boilerplate code that could be removed after this refactoring.

Conclusion

This was the first time I’ve experienced a hard forking bug up close, so it was quite enlightning and further imprinting how hard it is to make changes that affect consensus-critical code and how meticulous you need to be when implementing these changes and their accompanying tests.

It was also very helpful that at least one miner was willing to test the newest node version, so that the bug was caught before being released to the wider community and becoming a bigger problem to solve.

Acknowledgements

Thanks to @davecgh, @jy-p and @richardred for reviewing, proof-reading and adding corrections to this article.

Matheus Degiovani