Cloud

Why EKS Control Planes Don’t All Crash Together (and Yours Shouldn’t Either)

September 11, 2025 3 Min Read

132 0

I don’t work for AWS, but I often wonder how they achieve near zero downtime for services like Amazon EKS. AWS manages the control plane while customers manage the worker nodes, yet the EKS architecture is deliberately designed for resilience:

Table Of Content

Understanding AWS Placement Groups
Designing with Spread & Cluster Placement Groups
🔒 Spread Placement Group for Management Nodes
⚡ Cluster Placement Group for Worker Nodes
Spread vs Cluster at a Glance
Balancing Trade-offs
Final Thoughts

The API server nodes run across at least two Availability Zones in every AWS region.
The etcd server nodes, which store cluster state, are deployed in an auto-scaling group spanning three AZs to ensure durability and fault tolerance.

This means failures in one AZ don’t disrupt the entire cluster which provides an elegant balance of availability and consistency.

Now, when I think about this, I can’t help but wonder:
👉 Does AWS achieve part of this resilience by leveraging concepts similar to AWS Placement Groups?

It wouldn’t be surprising if, under the hood, EKS makes strategic use of Spread Placement Groups to separate critical control-plane components across distinct hardware, while worker nodes might benefit from Cluster Placement Groups for low-latency communication.

This idea brings us to a broader design principle that we, too, can apply in our own systems.

Understanding AWS Placement Groups

AWS provides three types of placement groups, each designed for different workload needs:

Spread Placement Group – spreads instances across distinct underlying hardware. Best for critical nodes that must not fail together.
Cluster Placement Group – places instances physically close together for low-latency, high-throughput networking. Ideal for tightly coupled compute workloads.
Partition Placement Group – separates instances into logical partitions across racks, useful for large-scale distributed systems like Hadoop or Kafka.

For this discussion, we’ll focus on Spread and Cluster groups.

Designing with Spread & Cluster Placement Groups

Imagine you’re building a Kubernetes-like system with critical management/control nodes and worker nodes. You want resilience for the control plane, but also fast, efficient communication for the workers.

🔒 Spread Placement Group for Management Nodes

Your management/control nodes (like EKS’s API servers or etcd) are the brains of the system. If they go down together, your entire platform becomes unavailable.

By placing them in a Spread Placement Group, AWS ensures these nodes are deployed on separate hardware racks, separate failure domains, and potentially across AZs.

This minimizes correlated failure.
A single hardware or rack issue won’t take down all your management nodes.
Perfect for components where availability is greater than performance.

⚡ Cluster Placement Group for Worker Nodes

Worker nodes, on the other hand, often need high-throughput communication with each other. Think of workloads like:

Machine learning training clusters
High-performance computing (HPC)
Real-time analytics engines

Here, the priority isn’t resilience against correlated hardware failure but ultra-low latency networking. A Cluster Placement Group ensures that:

Worker nodes are placed close together in the same rack.
They enjoy 10 Gbps+ throughput with minimal latency.
The trade-off: if that rack fails, all your workers fail together.

And that’s okay — because workers can usually be replaced quickly using tools I like so much like CASTAI and AWS Karpenter, unlike management/control nodes.

Spread vs Cluster at a Glance

Feature	Spread Placement Group	Cluster Placement Group
Goal	Maximize availability & minimize correlated failures	Maximize performance & minimize latency
How it Works	Instances placed on separate hardware racks	Instances placed close together in same rack
Best for	Control plane, critical management nodes	Worker nodes, HPC, ML training, data crunching
Failure Impact	A single failure affects only one node	Rack failure may wipe out all nodes together
Trade-off	Slightly higher latency, less network performance	Reduced resilience but very high throughput

Balancing Trade-offs

This design illustrates a key cloud architecture principle:

Not every tier of your system has the same availability vs. performance requirements.

Control Plane (critical brains) → Use Spread Placement Group for resilience.
Data Plane (worker muscle) → Use Cluster Placement Group for speed.

By intentionally mixing placement group strategies, you can design a system that is both highly available and highly performant.

Final Thoughts

AWS itself gives us inspiration through services like EKS, which are designed to tolerate failures without impacting customers. While we don’t know the exact implementation details, it’s possible AWS leverages concepts like Spread Placement Groups for resilience and Cluster Placement Groups for throughput under the hood.

By applying the same thinking in your own systems, you can balance durability with speed while designing architectures that gracefully handle failures without sacrificing performance.

Tags:

Inspire. Create. Evolve.

Social

Menu

dapo@oladapo.cloud

Inspire. Create. Evolve.

Why EKS Control Planes Don’t All Crash Together (and Yours Shouldn’t Either)

Table Of Content

Understanding AWS Placement Groups

Designing with Spread & Cluster Placement Groups

🔒 Spread Placement Group for Management Nodes

⚡ Cluster Placement Group for Worker Nodes

Spread vs Cluster at a Glance

Balancing Trade-offs

Final Thoughts

Tags:

Other Articles

The AWS Summit Toronto: A Day Packed with Innovation

Is Moore’s Law dead?

Popular Posts

GitOps High Availability: The Missing Piece in Multi-Cloud Strategies

Orchestration vs Job Scheduling: Understanding the Differences and Applications in AI and HPC

AI Maturity Model with Hybrid Approach for AI Model Deployment

NVIDIA DGX GB200 (AI Lamborghini)

Categories

Related Posts

Why EKS Control Planes Don’t All Crash Together (and Yours Shouldn’t Either)

The AWS Summit Toronto: A Day Packed with Innovation

Categories

Links

Follow

Inspire. Create. Evolve.

Social

Menu

dapo@oladapo.cloud

Type and hit Enter to search

Inspire. Create. Evolve.

Why EKS Control Planes Don’t All Crash Together (and Yours Shouldn’t Either)

Table Of Content

Understanding AWS Placement Groups

Designing with Spread & Cluster Placement Groups

🔒 Spread Placement Group for Management Nodes

⚡ Cluster Placement Group for Worker Nodes

Spread vs Cluster at a Glance

Balancing Trade-offs

Final Thoughts

Tags:

Share Article

Other Articles

Popular Posts

Categories

Related Posts

Categories

Links

Follow