Todo list

Posted: 2024-01-20

As Paul Ford wrote, “working is hard, but thinking about working is pretty fun. The result is the software industry.” In lieu of spending my spare time writing code, I wrote a todo list of code I’d like to write!

TLA+

I like the mental challenge of formal verification with TLA+. It’s a heavyweight tool, but it can be invaluable for finding protocol bugs in distributed systems before they are impossible to fix.

The TLA+ model checker is called TLC, and it works by exhaustively searching the state space and verifying invariants for each state. It can be hard to scope a model to both express the essential aspects of the system and be checkable in a reasonable amount of time.¹

There are two changes I’d like to make to TLC.

Symmetries

To reduce the search space, TLA+ allows authors to assert that states have symmetry. That is, some states are equivalent for the purposes of invariant checking. When checking a model with symmetry, TLC permutes each reached state by the allowed symmetries and chooses the minimal permutation under an arbitrary order. I’ll refer to this minimal permutation as the “canonical” state.² That canonical state is used to determine if the candidate is a new state to explore, or has already been reached and checked.

The only symmetries I ever attempted to use in TLA+ were permutations of equivalent model values. Model values are opaque, unique values, in my cases used for things like “client identities” or “Paxos acceptor identities.” But symmetry in TLA+ is more powerful than that, and authors can provide arbitrary functions whose domain and range are model values.

As a consequence of this generality, finding the canonical state is the NP-hard constructive orbit problem. TLA+ therefore evaluates every permutation to find the canonical state. However for the simple case of permuted sets of model values we can do much better, as Lamport noted in 2015.

As long as the members of each symmetric set are “dense” in the sort order of all values,³ then the specific permutation function which gives the canonical state of a candidate state can be constructed in linear time by inspecting each value element of the candidate in lexicographic order:

If the element is not a symmetry set member, skip to the next element.
If the element is not a key in the permutation map, insert it. Use the smallest unused member of its symmetry set as the value.

In a spec with two symmetry sets of three elements, this approach would reduce the per-state symmetry work by ~36x. For a spec with one symmetry set of three elements and another of five elements, it would reduce symmetry work by ~720x.⁴ Making this change would likely close this 7 year old github issue.

Distributed checking

Extremely large models are impractical to check on a single machine; I understand that it’s common in academia to let a model checker run over the weekend. To speed things up with more CPU, TLC supports distributed checking, but it doesn’t tolerate process or machine failures well.⁵ It would be good to tackle distributed checking with modern distributed systems tools.

This is a pretty large project. There are fundamentally four parts of the model checking machinery:

The representation of states
The code to evaluate a state and check its invariants
A queue containing unevaluated states
A set of already-evaluated states (e.g. represented as hashes or fingerprints)

At Google, I used FlumeJava for dynamic parallelism, but this approach to model checking was very IO-intensive and only feasible to run when there weren’t higher-priority workloads. FlumeJava requires a dataflow graph, which meant that the model had to be evaluated in layers,⁶ the number of layers to evaluate had to be predetermined at graph setup time,⁷ and there was a shuffle/sort step between each layer to update the fingerprint set. The FlumeJava setup was efficient at processing the very wide layers in the middle of the model, but much slower than stock TLC for the narrower layers at the start of the model.⁸ There was an impedence mismatch between the continuous process of state expansion and the static dataflow graph.

One approach to bring scale-out compute to model checking in a more fault-tolerant way might be to emit states into a multi-consumer queue (e.g. Kafka) and keep the fingerprint set in a distributed RAM cache (e.g. Redis). Even if the queue and fingerprint set are disjoint like this, we can guarantee that every valid state is evaluated at least once:

When a processor dequeues a state, it checks if the state’s fingerprint is already in the fingerprint set. If so, the processor can discard the state from the queue without evaluating its successors.
Otherwise the processor must first evaluate the state’s successors and insert them into the queue, then insert the state’s fingerprint into the fingerprint set, then finally discard the state from the queue.

In this naive approach, when multiple states generate the same successor it’s quite likely that states will be inserted into the queue more than once. I think this would stress the queue and have low goodput. However, it may be sufficient to add a single optimization: processors need not enqueue new states whose fingerprints are already in the fingerprint set. Because the fingerprint set never sees deletions, if a state is discardable at enqueue time it is guaranteed to still be discardable at dequeue time.

If that’s insufficient to make model checking CPU-bound rather than IO-bound, then the queue and fingerprint set should be merged into a single system which can guarantee that any given fingerprint is enqueued at most once, possibly with multiple queues partitioned across the fingerprint space.

Good metrics would also be important. There are a lot of nice options out there for both metrics collection and visualization.

Cancellable queue

For fun, I wrote a multi-producer, multi-consumer queue which operates on values in place,⁹ can accommodate variable-length records, and has safely cancellable waits. Its API for pinning and releasing records that have been dequeued uses intuitive value semantics. There are a few things I’d like to do to extend it and experiment with modern C++ language features.

Open-loop benchmarking via coroutines

I wrote the queue with a pair of benchmarks attached: a full-throttle throughput microbenchmark with fixed-size records and a more holistic benchmark which varied record sizes and tracked latencies. However, the holistic benchmark is closed-loop,¹⁰ so when the queue is in overload its reported latencies are not necessarily predictive of what a user might see.

I think it would be fun to support open loop benchmarks by using coroutines, if only to learn more about the coroutine system in C++. I’d rewrite the enqueue function as a coroutine which suspends if the queue is full, and the dequeue function as a coroutine which suspends if the queue is empty. In each case, a waiting coroutine would be resumed by the “other side” of the queue, or by cancellation.

The real trick would then be to implement the latency measurement code as a coroutine, too! Ultimately more consistent enqueue interarrival times would yield more realistic latency measurements, even when the queue is overloaded.

Efficiency via emplace

The enqueue API takes a std::string_view argument and copies its contents into the queue. It’d be nicer to allow in-place construction via an emplace API to avoid copies in some circumstances. This requires a little refactoring to put more of the enqueue code in the header so the template with the forwarding references can be instantiated, but is not fundamentally difficult.

Efficiency via batch ops

It’d be nice to support batched dequeue especially, to reduce locking delays when records are released. I’m not sure whether I’d like to do this as an API addition, or as a wholesale replacement of the single-record dequeue function with a batched one.

A very fast replicated log

Finally, a primitive I’ve seen implemented a few times in domain-specific ways is a replicated log decoupled from, but clearly meant for, a specific replicated state machine. Consider write-ahead logs, message queues, or other consistency-critical data structures. I’ve mostly seen logs decoupled from the larger state to minimize latency.

I think it would be fun to write an extremely limited replicated state machine for a very high performance ordered log. The only supported operations would be to appending opaque bytes (or atomic batches of opaque bytes) to the tail and discarding them from the head.¹¹ Much of the interesting work would be in making the state machine’s fault-tolerance properties configurable. Rather than an N-of-N phase 2 quorum as with Kafka, it could be flexible and support e.g. a strict majority quorum like classic Paxos, or a 2-of-N quorum to tolerate a single failure with nicer latency characteristics. Rather than forcing an acceptor to be a proposer as in Raft, it could support proposers that are not colocated with any acceptor.

I think a useful addition, not common in other similar systems, would be to include a client-controlled element in the ballot tuple.¹² This way, applications with a primary client can retain strong-consistency¹³ knowledge of the log contents even if the proposer changes due to e.g. a machine failure. I’ll perhaps expand on why this would be helpful in a future post.

Thinking about working

I might play around with my own cancellable queue, but realistically, in the absence of a work-related reason to take on one of these tasks for an employer, I’m unlikely to spend my decreasing spare time (I have a 6 month old!) on any of the other, more complex projects. But thinking about creative intellectual work is pretty fun, and I’m glad I found some time to put these thoughts into writing. Maybe one of my readers will tell me related work is already under way or complete!

--Chris

There is also a theorem prover which can be used to demonstrate that a spec maintains its invariants without explicitly exploring the state space. I got much of the way through a proof of a nontrivial system, but ultimately running TLC was more effective at finding issues. My understanding is that ~every large project that uses TLA+ checks the models with TLC. ↩︎
TLC values include atoms like numbers and compound objects like sets, and there is a strict weak order over all possible TLC values. The model variables whose values make up the state are treated as an ordered tuple, and states are compared lexicographically. ↩︎
That is, given a set S of symmetric model values and a total order over elements of S, and the set V containing all possible values (such that S is a subset of V), and recalling there is a strict weak order over elements of V, then: \A v \in V : v \notin S => ~(\E a, b in S : a < v /\ v < b) ↩︎
TLA+ has a somewhat misleading syntax for defining multiple symmetry sets. Permutations(S) returns the set of functions which permute S. To use multiple symmetry sets, like S == {s1, s2} and T == {t1, t2}, you write: Symm == Permutations(S) \union Permutations(T). By the definitions of \union and Permutations, you might think that this is wrong: minimizing a state like << s2, t2 >> needs to apply a function from each set, so you might think you need \X to indicate a cross product. Instead, TLC handles this by special-casing symmetry definitions to use unions of sets of functions. ↩︎
It’s a great irony that a tool for checking the properties of fault-tolerant distributed systems does not tolerate faults well! ↩︎
Specifically, the first layer contains the initial states, the second layer contains all new states reachable from the first layer, the third layer contains all new states reachable from the second layer, and so on. ↩︎
Technically, this is true of each FlumeJava pipeline, but a program could (and mine did) instantiate multiple pipelines. Unfortunately, to use the output of a prior pipeline as input to the next, the prior outputs have to be fully serialized to stable storage. ↩︎
In the first 10 minutes checking a model with several narrow layers early on, stock TLC on a single machine might evaluate the same number of states as FlumeJava with arbitrary amounts of compute available. ↩︎
Records live in the queue because I considered making the queue shareable across processes, and therefore across address spaces. That wouldn’t be a fundamental API change, but it would make some of the condvar wait code much less pleasant. I was writing the queue for fun, so I left it single-process. ↩︎
For a succinct summary of the differences between open-loop and closed-loop offered load, see this blog post from Marc Brooker. ↩︎
To be self-contained, the state machine should also store the acceptor set and quorum configuration. ↩︎
While descriptions of Paxos generally present the ballot as a single number, the protocol in fact only requires that there is a total order over ballots. It’s not unusual for Paxos ballots in real implementations to be a tuple of integers ordered lexicographically, or even for the ballot to include the proposer’s identity. ↩︎
Intellectually, my favorite way to achieve this would be to allow “appends” of zero bytes. But in reality I think that’s too clever by half and in a system as complex as a Paxos implementation it’s better to be explicit and include a noop state machine command. ↩︎