Understanding CAP and PACELC — Why Distributed Systems Can't Have It All
I have been working as a backend developer for a while now, mostly building Java microservices using Spring Boot.
In day-to-day work, I often come across terms like consistency, availability, and replication while designing APIs or integrating with databases.
Even though I’ve used distributed systems like Kafka, Redis, and MongoDB in projects, I realized I never really understood why they behave the way they do under network failures.
So, I decided to spend some time reading about these two important concepts — CAP and PACELC — and document my understanding here.
My goal is to explain them in simple, practical terms, the way I’d explain them to a fellow Spring Boot developer trying to level up in system design.
Why Do We Even Need CAP?
In a typical microservice architecture, we often have multiple services running on different nodes or containers.
Each service might store data in its own database or cache, and communicate with others through REST APIs or message queues.
Now, when systems are distributed across multiple nodes or regions, network issues are inevitable — some requests fail, some nodes get isolated, and sometimes two parts of the system don’t agree on the data.
The CAP theorem helps us understand what trade-offs systems make when such failures happen.
CAP Theorem — Pick Two, Not Three
CAP stands for Consistency, Availability, and Partition Tolerance.
It describes three desirable properties of a distributed system — but you can only fully guarantee two at the same time.
| Property | Description |
|---|---|
| Consistency (C) | Every read returns the latest write. All nodes see the same data. |
| Availability (A) | Every request gets a valid (non-error) response, even if some nodes are down. |
| Partition Tolerance (P) | The system continues working even if parts of it can’t talk to each other. |
When a network partition happens (for example, some services lose network connectivity), the system has to choose between:
- Consistency (wait until all nodes agree on the data), or
- Availability (serve data from available nodes, even if it’s stale).
That’s the essence of CAP — you can’t have it all during a network failure.
💬 Example: Chat Application Analogy
Imagine a chat app built with Spring Boot microservices deployed in two regions.
If the network between them fails:
- If you choose Consistency, you’ll delay message delivery until both regions sync.
- If you choose Availability, you’ll allow users to send messages instantly, even if the other region hasn’t caught up yet.
Both approaches are correct — it depends on what’s more important for your system.
Real-World Systems and Their Choices
| System | Category | Reason |
|---|---|---|
| Google Spanner | CP | Prioritizes consistency, even with slightly higher latency. |
| Cassandra | AP | Always responds, even if some replicas have stale data. |
| MongoDB (default) | AP | Focuses on availability; can be tuned for stronger consistency. |
In practice, many microservice systems lean toward AP — because availability is often more important for user experience, while internal systems (like financial or inventory services) prefer CP for correctness.
CAP Has a Blind Spot
CAP only describes what happens when the network fails.
But what about when everything is running smoothly?
Even in the absence of failures, there’s another trade-off — latency vs consistency.
This is where PACELC comes in.
PACELC — The Missing Piece
Daniel Abadi extended the CAP theorem into what’s called PACELC.
It adds another layer of understanding:
If there is a Partition (P), choose Availability (A) or Consistency (C). Else (E), choose Latency (L) or Consistency (C).
In short:
- When there’s a failure → you face the CAP trade-off.
- When everything is fine → you face a Latency vs Consistency trade-off.
Example Table
| System | When Partition | Else (Normal) | Behavior |
|---|---|---|---|
| DynamoDB | Availability | Low Latency | PA/EL |
| Spanner | Consistency | Consistency | PC/EC |
| Cassandra | Availability | Low Latency | PA/EL |
🛰 Example: Google Spanner
Spanner is a globally distributed database by Google that offers strong consistency.
It achieves this using TrueTime, a combination of GPS and atomic clocks that keeps time synchronized across data centers.
By using timestamps with bounded uncertainty, Spanner can ensure that reads and writes are globally ordered.
The trade-off is higher latency, but you get external consistency — which means transactions behave as if there’s just one global clock.
Why Should Spring Boot Developers Care?
As backend engineers, we often integrate with distributed databases, caches, and queues.
Understanding CAP and PACELC helps us make better choices about:
- Which database to pick (SQL vs NoSQL, strong vs eventual consistency)
- How to handle retries and fallbacks when services are unavailable
- How to reason about data accuracy vs response time
Even when you use frameworks like Spring Cloud, these concepts still matter — because every timeout, stale cache, or retry policy is a reflection of a trade-off between consistency, availability, and latency.
My Takeaway
Before learning CAP and PACELC, I often took distributed system behavior for granted — “sometimes it’s slow,” “sometimes data lags.”
Now I understand these are not random problems, but expected consequences of design trade-offs.
Every system has to choose what it values more:
- Correctness (Consistency)
- Responsiveness (Availability / Latency)
There’s no “perfect” combination, only a balance that fits your business and technical goals.
Quick Reference
| Acronym | Meaning | Example |
|---|---|---|
| CP | Consistent + Partition-tolerant | Google Spanner |
| AP | Available + Partition-tolerant | Cassandra |
| PACELC | Partition → A/C, Else → L/C | General rule for distributed systems |
That’s all for now. If you’re exploring how distributed systems really work under the hood, I highly recommend revisiting CAP and PACELC.
It completely changes how you look at microservice architecture design.