196 points by tripdout 13 hours ago | 17 comments
refibrillator 11 hours ago
CAP theorem is great, though it does omit latency. Which leads you to the logical extension, PACELC [1]:

If there is a network Partition you must choose Availability or Consistency, Else the tradeoff is Latency vs Consistency.

This offers practical intuition for certain design choices.

For example Google Spanner [2] is a distributed DB that offers globally consistent reads/writes, but to achieve high performance (low latency) nodes must synchronize with multiple extremely accurate reference clocks (GPS & atomic) and follow a complex two phase commit protocol that ensures transactional linearizability using lower bounds and uncertainty of timestamps.

As another example, your multicore CPU leverages a cache coherency protocol that faces remarkably similar tradeoffs. Perhaps others have made this connection before…it does feel like some sort of universal law of physics.

[1] https://en.m.wikipedia.org/wiki/PACELC_theorem

[2] https://static.googleusercontent.com/media/research.google.c...

danielheath 8 hours ago
Is a Partition meaningfully distinguishable from especially high Latency?
7 hours ago
Retr0id 1 hour ago
I'm sure I'm far from the first, but I've also made that connection before. I think I phrased it "modern CPUs are just very small distributed systems"
ojkelly 1 hour ago
> Perhaps others have made this connection before…it does feel like some sort of universal law of physics.

This has been on my mind too, and I can’t help but think the fundamental concept underpinning it all is causality.

Reading about “consistency as logical monotonicity” — CALM [0], after diving into Spanner, there’s definitely something to databases beyond CAP.

I’m yet to find a simple and clear law or theorem that captures what you’re hinting to, but it feels like we must be getting close. This has been bouncing around my head for a few years since I first wrote a toy CRDT DB.

It seems to show up anywhere we have more than one system with independent memory (a place where state can be held), needs to maintain a shared representation or fact about something.

Within one memory (akin to a reference frame in quantum physics), we can do anything we want. We can mutate state without any interaction or interference from the outside world. This also sounds like a pure function. By itself, this is academic, theoretical—it does not, it does not exist. If a tree falls in the woods.

So if we want to interact with any other systems, we then need to coordinate, and the question is how.

The issue and pattern seems to rhyme everywhere. CPUs, HDD, SSD, file systems, networks, UIs, people, teams, etc. The best possible collaboration seems to be that which requires the least coordination. Money is a good example here, someone can string together a series of products from companies who know nothing about each other, by coordinating with money - as a means of exchange. Not to mention being able to buy complex technology with no idea the supply chain behind it. I don’t have to coordinate with a mine to use a computer, which contains something from said mine.

It sort of looks like distributed databases build a virtual memory on top of many smaller memories, which need to share some parts with each other to protect the system as a whole.

New information may need a few writes before it can be considered a new fact. I think this is an implementation detail, in that it’s irrelevant to the observer (who has no control over it).

This isn’t eventual consistency, which is perhaps the cruder form of this where the implementation detail above is wide open for all to see. Instead new information is available immediately, just your definition of immediately is different to the databases.

It follows then, that as an observer of the system you cannot violate causality in any way by learning information from the future, while they are still in the past.

My understanding from Spanner is that when you ask for a read, you provide or are issued a timestamp which provides the causal information to determine what you are allowed to know.

The system can then maintain both a true and correct representation of what it knows, and an incomplete working memory of what it is trying to know next (the writes which need to be committed into multiple memories).

Memory being anything from ram, ssd, carrier pigeon, lines in sand, etc.

I think where this breaks most of our brains is that it’s a universe simulation.

And both time and causality are fundamental invariants of the stability of the system, but are leaky abstractions that we need to deal with.

In CALM this is abstracted into what is effectively entropy. If your program never moves backward in time/loses entropy it’s CALM (I think). In earlier works I think Lamport and vector clocks were used, in Spanner it’s based on very precise measurements of our own time, where the maximum speed of information (ie speed of light) is the greater of the smallest unit of time we can measure (the precision of the GPS clock) and the time it takes for new data to become available in the system.

The other part where this differs from the read world is that the speed of information, the latency of a request, is different for reads and writes. Not true in the quantum world where an everything is a wrote (I think). Then, consider that in our own reference frame we can do a huge amount of work while waiting for a db read/write, something that would violate the speed of light if not in our virtualised world.

We cannot break causality in the world we live and breathe in, but we do it all the time in our simulated ones.

[0] https://arxiv.org/abs/1901.01930

shepherdjerred 9 hours ago
CAP doesn't omit latency. Availability essentially is latency.

PACELC only says that even during normal operation you still need to make tradeoff between availability and latency whereas CAP only deals with a tradeoff during a partition event.

tux3 3 hours ago
The A in CAP doesn't mean what people think it means, it has nothing to do with nodes being up, or crashes, or latency, or SLA, or any abnormal dysfunction.

Availability in CAP is purely a software decision, it's an algorithmic property. Roughly speaking, it is the decision to continue accepting requests during a partition, possibly sacrificing Consistency. If your refuse requests during a partition, you conserve Consistency, and you lose Availability. If you keep accepting requests during a partition, it's the opposite.

High latency is considered a partition in CAP. Any hardware error, any network trouble, any bug, crash, any unreachable machines is never an Availability issue.

shepherdjerred 32 minutes ago
> If your refuse requests during a partition, you conserve Consistency, and you lose Availability. If you keep accepting requests during a partition, it's the opposite.

I agree

> High latency is considered a partition in CAP.

Can you support this claim? I don't think this is true unless you're specifically talking about high latency between nodes rather than latency between the sender and the system.

rtpg 9 hours ago
Availability's definition:

> every request received by a non-failing node in the system must result in a response

I don't believe that encodes latency

Instead, here's consistency:

> any read operation that begins after a write operation completes must return that value, or the result of a later write operation

"after a write operation completes" feels like where latency kicks in? Because within that space you can play around with completing a write operation to get what you want.

shepherdjerred 9 hours ago
It's impossible to distinguish between high latency and unavailability. You can model unavailability as infinite latency.

In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).

nextaccountic 9 hours ago
> In a real system you can wait for a partition event to resolve itself before replying which preserves consistency at the cost of latency (availability).

This is true, but PACELC states that even if there is no partition you still have a consistency vs latency tradeoff (because the processes that guarantee consistency eat up latency in form of network roundtrips)

shepherdjerred 9 hours ago
100%, I don't think what we're saying conflicts. There is _always_ a tradeoff between consistency and availability whether you're thinking of PAC or PACELC. PACELC just makes it explicit that this tradeoff exists whether you're partitioned or not.
mrfox321 2 hours ago
simultaneity is relative.

That's all.

stevebmark 9 hours ago
I prefer the walkthrough of the CAP theorem in “Designing Data Intensive Applications,” which says that the CAP “theorem” is an undefined and generally unhelpful concept in distributed systems. “Available” has no formal definition.

And it’s confusing that it’s three letters, because it’s not “choose two”. It’s not that a system is “partition tolerant,” it’s if there’s a network partition, you choose availability or consistency. And obviously you choose availability for the distributed systems you most commonly encounter.

dijit 1 hour ago
> And obviously you choose availability for the distributed systems you most commonly encounter.

fail safe (rather than fail open) is generally regarded as the most acceptable thing we can do as CRUD programmers; that’s true, however it’s categorically untrue that this is a foregone conclusion.

There are many cases (especially in financial services) were failing safe is a much preferable option and having retry logic in application is much preferred

yas_hmaheshwari 5 hours ago
I also like the fact that how he says that that theorem was a good theorem for the time but is now not that relevant

Maybe PACELC is better theorem, maybe not - but I guess three letter acronyms would always rule better than a six letter one

sethammons 1 hour ago
Esp since CAP is easily pronounced. "PACELC"? Is it "Pace-elk"? "Pak-Elk"? "Pacel-C"?
tajd 1 hour ago
For the interested reader you can see a paper talking about this by Martin Kleppmann https://arxiv.org/abs/1509.05393 - the author of DDIA.
jeremyjh 1 hour ago
The formal definition for Available is given in TFA.
puzzledobserver 6 hours ago
One thing that I've always wondered about: Is the CAP theorem really making a non-trivial claim?

If a distributed system is consistent and available, it must return the last written value, and successfully do this all the time. How can it do this if the system is partitioned, and the node receiving the last write was unable to propagate this to other nodes?

The proof described in this website appears to confirm this. Am I missing something?

wesselbindt 5 hours ago
You're not missing something. In the original paper the proof is half a page long (and still worth a read, very low investment, decent payoff). I think the value of the CAP theorem is not so much in its statement (CP or AP, choose one), but rather in its proof. It helps you cook up similar arguments/reasonings for more interesting situations and constraints.
dijit 1 hour ago
I think it’s also very interesting to point out that very often people try to choose all three but end up choosing only one inadvertently.

We should also point out that it’s not a foregone conclusion that you will even be able to achieve two of three.

CAP theorem really is a best case scenario.

vzaliva 10 hours ago
The text left me wanting a more formal treatment of the theorem. I went ahead and found a paper which does just that:

https://dl.acm.org/doi/10.1145/564585.564601 PDF: https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.p...

fithisux 7 hours ago
I was always wondering the same. Thank you for the references. Above I asked for a mechanized proof.
magnio 10 hours ago
The "proof" is kind of weird. We assume there exists a system that has all three of CAP, but how can we assume that system has the layout in the post with two servers and one client?
sushibowl 9 hours ago
The proof is not really formal, but you could view the shown system as a minimal subset of an arbitrarily shaped large system.

For the system to be distributed it must have at least two nodes, and to be available all nodes must respond to requests. So however the rest of the system is shaped, the proof still holds.

dmurray 6 hours ago
The minimal subset is too small. It should have a second client, to avoid solutions like "the client stores a copy of all the messages it sent, so it can detect the stale value".
jeremyjh 58 minutes ago
The theory has nothing to do with what the client knows. It’s only concerned with the server’s responses.
evertedsphere 6 hours ago
what if the graph remains connected when you remove that link, so the two nodes can communicate through other nodes?
dmurray 6 hours ago
Partitioning is defined so that all, or at least arbitrarily many, of the communication links may be down.

In practice you're absolutely right and one approach to distributed systems is to make partition tolerance less important by having lots of redundant links instead.

jascha_eng 11 hours ago
This is trivial but the problem is that especially consistency is not a binary choice. Heck even non distributed systems can give you varying consistency guarantees it's the whole point of isolation levels of most RDBMS.

It's good to visualize that there is a trade off to be made. However that trade off does not have to be binary. You can get a bit more consistency for a little less partition tolerance or availability. All of Designing data intensive applications is about those trade offs.

dilyevsky 9 hours ago
Consistency in CAP and Consistency in ACID have entirely separate meanings.
gpderetta 5 hours ago
The C in ACID has a different definition of Consistency, yet the combination of guarantees given by ACID should also imply Consistency in the distributed system sense, right? I.e. a distributed database that claims to be ACID cannot sacrifice consistency [edit: at least for some isolation levels].
anonymousDan 4 hours ago
Isolation (the I in ACID) is more closely related to the notion of consistency in the distributed system community.
anonymousDan 4 hours ago
Consistency in the case of CAP refers to linearizability.
whatshisface 12 hours ago
How about this:

1. A read is returned with a map of the network the replying node has become consistent with.

2. A client that is not satisfied with the map can read again from the other partition.

These two steps get around the proof by changing one of its assumptions (that the operation to read a value can only involve a request to one node).

anderskaseorg 12 hours ago
You’re certainly allowed to make the client a more active participant in your consensus protocol, but then it needs to play by the same rules if you want the system to have guarantees. For example, you need to handle network partitions between clients and some servers, and you need to be able to reconcile multiple reads from servers that might have seen different sets of writes. The CAP theorem still applies to the system as a whole.
Dylan16807 10 hours ago
Do you have a plan for doing writes in that scenario? It's not proper availability without writes.

If the partitions are configured for consistency, and they can't hear from each other, then at most one will accept writes. If the client relays metadata between the partitions to make the write happen in both, then you don't actually have a partition anymore.

Also in practical terms, the nodes almost always have better contact with each other than the client has to the nodes. The situation where the client can connect to both sides of a partition is unlikely enough that if you add code for it you're probably getting negative value. It wastes time and can add scary bugs.

fallingsquirrel 12 hours ago
Consider the following sequence* of events:

1. client A reads from partition X

2. client A is unsatisfied with the network map, and requests another read from partition Y

3. (meanwhile) client B writes a value to partition X

4. client A reads from partition Y, sees the value is the same as partition X's (stale) value, and accepts the value as consistent

This is the same kind of behavior you might get if your servers used e.g. a buggy version of Raft or something. You can't get around the proof by just relabeling some of your server nodes as client nodes.

* In the spirit of distributed systems, I use this term loosely :)

Dylan16807 12 hours ago
The value is now stale, but it was correct at some point between the read starting and the read finishing, right?

That can happen with any read. Even without any partitions. Can't it?

In that case I don't see the problem.

lmm 11 hours ago
> The value is now stale, but it was correct at some point between the read starting and the read finishing, right?

Not necessarily. Maybe both versions of it were from partial writes that were never committed, so your invariants are violated (if we're talking about e.g. a credit account A and debit account B scenario).

> That can happen with any read. Even without any partitions. Can't it?

Depends on your isolation level. If your system has serializable transactions then it's supposed to give you a history equivalent to one where all transactions were executed serially, for example.

Dylan16807 10 hours ago
> Not necessarily. Maybe both versions of it were from partial writes that were never committed, so your invariants are violated (if we're talking about e.g. a credit account A and debit account B scenario).

I'm pretty sure the scenario above is looking at committed writes.

If you're reading uncommitted writes, you're not really in the market for consistency to being with. (Or you could be handling consistency by waiting to see if the transaction succeeds or fails, making sure it would fail if the data you read got backed out. But in that situation nothing goes wrong here.)

> Depends on your isolation level. If your system has serializable transactions then it's supposed to give you a history equivalent to one where all transactions were executed serially, for example.

Even then, a new write can happen before your "read finished" packet arrives at the client, making the read stale. Your entire transaction is now doomed to fail, but you won't know until you try to start committing it.

For pure read operations, I'm not convinced it's a proper stale read unless the value was stale before the read operation started.

lmm 9 hours ago
> I'm pretty sure the scenario above is looking at committed writes.

> If you're reading uncommitted writes, you're not really in the market for consistency to being with.

But what does "committed" mean when you're only reading from one partition in a partitioned scenario? You literally can't tell whether what you're reading is committed or not (or rather, you have to build your own protocol for when a write is considered committed).

> For pure read operations, I'm not convinced it's a proper stale read unless the value was stale before the read operation started.

I think you can get a read that is half from a prior stale operation and half from a subsequent uncommitted operation, or something on those lines.

Dylan16807 8 hours ago
> But what does "committed" mean when you're only reading from one partition in a partitioned scenario?

I would say you can't make new commits in that situation? I don't know, I didn't make up the scenario, I think you need to figure out your own answer and/or get clarification from fallingsquirrel if you want to talk about that kind of problem.

> I think you can get a read that is half from a prior stale operation and half from a subsequent uncommitted operation, or something on those lines.

What's the full timeline for that?

If you're specifically talking about the ABA problem, that's trivial to fix with a generation counter.

lmm 10 minutes ago
> I think you need to figure out your own answer and/or get clarification from fallingsquirrel if you want to talk about that kind of problem.

Right. The point is that fallingsquirrel hasn't solved the hard part of the problem.

> If you're specifically talking about the ABA problem, that's trivial to fix with a generation counter.

Maybe, but you need to specify that that's what you're doing, and it may come with undesirable consequences.

11 hours ago
satisfice 12 hours ago
How can it know what it is consistent with? Anything could have happened since the last update. All it knows is that it most recently got an update at a certain time. It may or may not be consistent.
dmitry-vsl 12 hours ago
So, it's impossible to transfer a value from one machine to another if there's no network connection between them? How did this extremely trivial observation become a well-known theorem?
crdrost 8 hours ago
So, when it was originally phrased, the primary thing that you would have learned about databases, was that they enable ACID transactions. (And if you were a practitioner you would have learned about the various isolation levels and dirty reads and dirty writes.)

But if you wanted to go from this level to implementation, typically you could get a prototype working, but it would be slow. When things are slow, the WD-40 of the programming world is to add a cache. And this is where you get the quip that there are only two hard problems in computing, cache invalidation and naming things. (The “and off-by-one errors” is a later addition.) The problem is that cache invalidation shows up as consistency bugs in your database. Someone does a rare combination of commits and rollbacks and some cache doesn't get wiped quite as fast as it needs to, or is wiped overoptimistically causing a pull from uncommitted data, and your isolation level has dropped to READ UNCOMMITTED.

The CAP theorem was originally raised as a conjecture, something like, “once you shard the database, I don't think there's any way to solve these cache problems without one of the replicas just going silent for arbitrarily long pauses while it tries to at least partially synchronize its cache with the other shards.” Phrased that way, you can understand why it was a conjecture, it relies on nobody at MIT having a super clever way to deal with caches and cleverly routing the synchronization around the sharding.

BUT, you can make that statement for many different reasons, and this was not for Pedagogical Reasons, its point was rather Evangelism! The author was attempting to introduce the idea of Eventual Consistency, and gain adoption by ditching all of the wisdom about ACID transactions. This antagonism was deliberate, eventual consistency became the E in a rival acronym BASE. And so the argument was that we could explore a new corner of design-space.

It was later that someone decided they could prove it by coming up with a universal subgraph, “whatever connection you've got, it has to contain this: two nodes, fighting over one value, with a network connection possibly passing through other nodes but we can abstract all that away.” And then you have a proof, and then you have a bunch of people comparing the proof to the stated claims of various database vendors, and finding that over and over they claim to be both shardable with high availability among the shards, and to support ACID transactions that keep everything consistent. It turns out those statements are usually made assuming a happy path!

(You also get Paxos and Raft, “here is how to get consistency without arbitrary latency on two-phase commit via majority vote”, and the Jepsen blog “you said this had consistency level X, let’s fuzz it and see if we can generate a counterexample”, and some interesting exceptions like Datomic saying “this one part is not scalable and it's a single point of failure to sacrifice P for the CAP theorem’s sake, but in exchange we can simplify our C and A guarantees so that you can scale the reads of the system consistently.”)

lmm 11 hours ago
Because, sadly, a lot of system designers want to believe they have a way around it.
dang 10 hours ago
Related:

An Illustrated Proof of the CAP Theorem - https://news.ycombinator.com/item?id=17528817 - July 2018 (71 comments)

ljsprague 12 hours ago
Is availability sacrificed more often than the other two?
shepherdjerred 9 hours ago
It depends purely on the application you're writing. Many applications prefer availability over consistency, but the application is built to cope with inconsistent data.
anonymousDan 4 hours ago
The only way to sacrifice P is to have have a single node (i.e. non-replicated) system. So in practice it's a choice between C and P.
wmf 12 hours ago
Properties are only sacrificed during partitions. Due to Murphy's Law, if you assume partitions never happen they will happen more often but if you can tolerate partitions they happen really rarely.
anderskaseorg 12 hours ago
There are plenty of systems that sacrifice consistency even while the network is fully connected, in the name of performance—for example, DNS, or any system with a caching proxy server.
wmf 11 hours ago
Yeah, CAP is about the best possible behavior a system can have but you can always do worse.
shepherdjerred 9 hours ago
Groxx 11 hours ago
Not really. Both are all over because they address vastly different (literally incompatible) needs.

There is a fair bit of industry hype in the past decade or so around eventually consistent systems, because a lot of components of a business can use them, and that opens up a lot of extremely desirable performance options that can be fine tuned to the moon. But they're a real nightmare to use (i.e. often literally impossible) for anything that needs to be correct, so they are often just used in addition to the more classical consistency-oriented databases like postgres.

mrybczyn 12 hours ago
Thank you! That makes a lot more sense than several other descriptions of the proof I have read (and promptly forgotten!)
lysecret 9 hours ago
I always think of capital asset pricing theorem first.
fithisux 7 hours ago
Is there a formalization of this proof? Lean, Coq, .... ?
11 hours ago
peter_d_sherman 10 hours ago
>"G2 returns v0 to our client after the client had already written v1 to G1. This is inconsistent."

This is absolutely true IF AND ONLY IF client and servers (and data) have no notion of time...

If client and servers are say, synchronized to a central clock, and do include a timestamp with any data updated, and include timestamps with any response/"done" messages and check those timestamps for accuracy (the client should perform timestamp checking as well as the servers), and if the client checks timestamps of response messages from different servers, it could then check different servers that it would later connect with, to see if they had been updated appropriately or not.

If later-connected-to-servers that should have been updated were not appropriately updated, then the client (if client and server are exchanging data and timestamp information) could decide how to proceed at that point.

If a later-connected-to-server was determined to be inconsistent, then depending upon how the client code was written, it could do many things to mitigate the problem. It could notify other servers that it knows about of data inconsistency on the inconsistent server(s), for example...

Hmm, now that I think about it, on any distributed database, for any given row of data, there should not be one, but TWO timestamp fields, one for the time that the data was updated on the first server it was updated on, i.e., the one the client connected to.

The second timestamp field would be for the time that a secondary given distributed server received the data and processed it...

Those two fields could be separated by mere nanoseconds of time (if distributed servers are tightly coupled), but it could be up to weeks if a secondary server was knocked offline for a long enough time period...

But see, if the client software is software engineered properly, and contains the client's history, then the client (or main server, or server multiplexing proxy) can check the veracity of any arbitrary distributed server it connects to by asking it to "read me back this data", getting its timestamps and comparing with the local copies...

All of that coupled with databases that do not delete/overwrite old records when data is updated, but rather keep a log of all updates with a timestamp, aka "append-only" aka "immutable" database, such as InfluxDB, Apache Cassandra, CouchDB, SQL Server Temporal Tables, (and I think that RethinkDB may have done something like that in the past) should equate to, at the very least, the client being able to know if a given server in a distributed system was updated properly (is consistent), or not...

If it were engineered properly, the client itself could determine what to do under such anomalous conditions...

It could even take it upon itself to update the server properly IF it contained a copy of the correct most up-to-date data AND had the authority to do so... which wouldn't be the case for some types of applications (i.e., banking, due to necessary security constraints), but might be the case for other types of applications (i.e., a distributed social media app, where the client is posting to the user's own account and has total permission to update any of the user's data as needed...)

Maybe a better question to ask when dealing with the CAP theorem is not so much "is it true", so much as "what kind of distributed database requires consistency such that the CAP theorem is required?" (i.e., banking), and "what kind of distributed database doesn't require consistency such that the CAP theorem isn't required?" (i.e., distributed social media over a plethora of distributed servers)...

If we see that CAP is required in only some specifc types of database/application/database contexts, then perhaps those specific databases/applications/distributed systems -- should be individually separated from non-CAP requiring ones...

This being said, I think Michael Stonebraker is a brilliant Computer Scientist, and I have tremendous respect for him and the CAP Theorem...

raincole 11 hours ago
What if the client just reads from every server and compares the timestamps?
shepherdjerred 9 hours ago
googledocsftw 9 hours ago
Read from every server is not necessarily possible, unless we assume no network partitions.
af3d 10 hours ago
Unfortunately, timestamps are not immutable. The system clock can be miscalibrated or even deliberately manipulated. It is however something of an open question. CAP is still a bit lacking in terms of rigorous mathematical proof (albeit the arguments are pretty convincing).
coolThingsFirst 10 hours ago
it violates the consistency rule. it must get the value that it was written
DarkmSparks 9 hours ago
The problem with the example is it never actually visualises a partition, because the client can always reach both servers.

an actual partition is when clients and servers are partitioned and cant talk to each other.

solving for the situation presented is fairly trivial, and is pretty much exactly what blockchain systems solve.