Disaggregated systems are distributed systems

Posted: 2023-04-02

In January, The Diff’s open thread questioned the consequences of compute models becoming more abstract than physical hardware. I’ve designed, developed, and operated distributed systems whose purpose is to make storage more abstract than physical hardware, and learned some important lessons. Abstraction leads to disaggregation. Disaggregated storage forces all client systems to be distributed systems. Disaggregated systems are distributed systems.

Abstraction and disaggregation

Distributed systems are more complex than centralized systems. If all of a system’s resources are locally attached to all its workers, and all its workers can communicate perfectly (e.g. all messages are delivered exactly once, in order), it’s feasible to mentally model the allowed behaviors for the workers and the state. It can make life easier for the system’s clients too, since it’s easier for the system to provide linearizable semantics. This centralized model maps to a single physical machine.¹

We build distributed systems to overcome the scaling and availability limitations of centralized systems. E.g. a distributed storage service abstracts away physical media, instead offering a similarly-performant interface to an “infinitely large” disk which “never fails.” To achieve the illusion, the storage is necessarily across a network, disaggregated from the client workers. This means that all systems which use the service are distributed systems (between the clients and the storage service) and makes the mental model much more subtle.

Compute abstraction and unexpected complexity

A concrete compute model is to have a single physical box in a closet, which must be unplugged before a replacement box can start up. With locally attached storage, this is a centralized system. With disaggregated storage, it is a distributed system.

The classic unreliable network model allows messages to be delayed, dropped, reordered, or repeated. Some disaggregated storage protocols (e.g. NFSv3) do not guarantee consistency under this model. Unplugging a box doesn’t necessarily prevent its messages from arriving at storage and causing mutations, even after a replacement box is plugged in! In practice, though, the physical topology of a box in a closet almost certainly follows a stronger network model, e.g. with bounded delays or no reordering. This is fortunate, because otherwise many systems would not work at all.

Abstract compute models like AWS Lambda, Google Cloud Functions, or Cloudflare Workers allow applications to ignore the physical limitations of a single machine and assume an “infinitely large” pool of CPU-seconds per second. To achieve the illusion, the workers are necessarily across a network, separated from each other and from storage.

In this model, it’s infeasible for all workers to share locally attached storage, because the workers are not necessarily locally attached to one another! Therefore, all storage must be disaggregated. Distributed workers also support a more complex network topology, and the delays, drops, reorders, and repeats of the unreliable network model are likely to occur in practice. A system using a compute abstraction like workers is therefore necessarily a distributed system, and has to carefully model how workers interact with shared state.

The scary part is that it’s easy to stumble into problems accidentally. Changing the compute model for a system with disaggregated storage might not feel like it changes the system’s consistency properties, because it doesn’t! The disaggregated storage meant that it was already a distributed system. However, the worker model took the risks of the abstract unreliable network and made them likely in practice. A system change which makes an existing risk much more common is harder to notice than one which adds a new risk that was not possible before.

Handling disaggregated state

A few strategies can make it easier to handle the distributed systems challenges posed by compute/storage disaggregation. They apply equally well to physical machines, VMs, or worker abstractions.

Use CP systems over AP systems

Eventual consistency is extremely hard to reason about, and even AP systems are unavailable sometimes. It’s hard enough to write distributed systems well, so unless the use case is extremely narrow or the benefits are extremely obvious, prefer CP² systems over AP systems for backing storage.

Use a transactional database

Transactional databases have defined rules for concurrent operations. It’s reasonable to run a private instance in a deployment with well-defined consistency properties.³ Cloud-native services (e.g. RDS, Aurora, DynamoDB on AWS, or Cloud SQL, AlloyDB,⁴ Spanner on GCP) will specify their consistency properties. Dig a little deeper than the marketing materials to expose the tradeoffs they make between availability, performance, and wire format compatibility with open source databases.

Use compare-and-swap

It’s possible to get strong consistency for data that is too big to put into a transactional database by atomically swapping contents. E.g. Google Cloud Storage supports compare-and-swap natively, using ifGenerationMatch on insert.

AWS S3 does not support compare-and-swap. You can get similar behavior by guaranteeing that writers use disjoint keys⁵ and using a transactional database to identify the consistent version of an application-level object.⁶

Use a system with strict ownership semantics

Some systems can serve all clients from a single primary worker, but have standby workers running in case the primary has issues. Such systems can get by if their storage doesn’t support concurrent access, but does support a strict ownership transfer.

For example, a GCP Persistent Disk can be attached to a single VM at a time.⁷ That single attached VM can successfully mutate the PD, and any delayed messages from earlier VMs are guaranteed to be rejected.⁸

Use time-based leases

Systems can use a time-based lease to implement mutual exclusion. At most one worker may hold the lease. If the lease holder can’t guarantee that it will finish its work or extend the lease before expiration, it must terminate.

Leases don’t provide strict guarantees about mutual exclusion under arbitrary clock drift and the unreliable network model, but they do under a stronger model where clock drift, message delays, and message repeats are bounded.⁹

Financial regulatory agencies enforce a small clock drift assumption, in the range of 200 microseconds.¹⁰ Regulation obviously can’t dictate physical reality, but typical lease timelines are measured in seconds, 4-5 orders of magnitude beyond the target clock drift. That’s a pretty long time to discover and remove bad potential lease holders. In practice, time-based leases are a reasonable approach for on-premise systems.

I don’t expect the cloud providers to explicitly offer a clock drift guarantee. However, I would expect them to in practice provide clocks with no more than, say, 10 milliseconds of drift ~all the time.

All told, I’d be quite comfortable using time-based leases for any system where a single worker can handle the full offered load, and it’s acceptable to clean up a corruption once every 10 years or so.¹¹

Warning: consistency does not compose

Except for time-based leases, all the discussion above assumes a single consistent disaggregated store for all the system’s state. Be careful: consistency does not compose! If system state is spread across two consistent stores, writing to those stores independently can still result in inconsistent system state. They need to be joined together into one consistent protocol to get consistent behavior.¹²

Thoughts

As people move to more abstract compute, they must use disaggregated storage, and that makes their systems distributed systems. The worker model in particular guarantees the more interesting distributed consensus problems, since the workers are distributed as well. The model you use for state storage and mutation are much more important when using workers.

--Chris

To an extent! The closer you look at modern computer architecture, the more like a distributed system it looks. Moreover, people often loosen inter-thread messaging guarantees to gain performance, which can make the mental model much more difficult. I specifically mean that it’s easier to model systems assuming (1) perfect communication between workers and (2) the existence of a “hard crash” which is guaranteed to stop all old workers and coalesce shared data into a fixed state before starting any new workers. ↩︎
This is Eric Brewer’s CAP theorem formulation. For Daniel Abadi’s PACELC formulation, prefer PC/EC systems. ↩︎
Lamport demonstrates that the strong consistency properties of a physical primary/secondary replication scheme with human-triggered failover are equivalent to those of a Paxos replicated state machine. ↩︎
I designed and implemented AlloyDB’s distributed consensus systems. ↩︎
S3 supports object versioning, which you can think of as a way to guarantee disjoint keys for concurrent puts. ↩︎
Don’t sleep on the difficulty of garbage collecting blobs that never become visible! On the one hand, this is “just” additional dollar cost. On the other hand, that cost can add up, and it will tend to accrue more quickly when the application is misbehaving. ↩︎
Shared PD makes things more complicated, but the model is analogous at the block-range level. ↩︎
I suspect, but don’t know for sure, that AWS EBS volumes offer the same guarantees. Send feedback if you know of any distinctions! ↩︎
The lease-granting system itself is typically safe under the unreliable network model, but any other systems that the lease holder sends messages to might not! ↩︎
If all clocks have to be within 100 microseconds of UTC, then the worst range between two clocks is 200 microseconds: one that’s the maximum ahead and another that’s the maximum behind. ↩︎
This is ~every system. Most data corruption is straightforwardly caused by buggy code, not subtle behaviors allowed by a distributed consensus protocol. People write bugs much more often than once every 10 years. ↩︎
For example, a Spanner database is split up into multiple Paxos groups. Even though each group is a consistent replicated state machine, a transaction which uses two groups has to use another consensus protocol (in Spanner’s case, two-phase commit) to consistently access both state machines. ↩︎