System Design for Scale, Failure, and Evolution
“You’ll of course be able to design systems like YouTube by the end of this course — but that’s not the point. Our focus is on the underlying principles: consistency, concurrency, coordination, and the realities of failure. Once you master those, designing anything becomes a matter of structured thought — not memorized patterns.”
— Chiradip Mandal
Overview
- Duration: 4 Days × 2 Hours/day
- Format: 1.5h deep dive + 30m hands-on GitHub workshop
- Target: Staff+, Principal, L8–L10 engineers
- Pedagogy: Theory + application + architecture retrospectives
- Deliverables: GitHub architecture, design docs, book access
- Post-Cohort: Design circles, long-term cohort mentorship
Learning Goals
- Master architectural tradeoffs: consistency, concurrency, consensus
- Build systems with realistic latency, failure, resource constraints
- Collaboratively evolve an architecture via GitHub PRs
- Join a durable network of peer reviewers, technologists, and system leaders
Day-by-Day Syllabus
Day 1 — Bounded Correctness
Theme: Designing for Correctness under Latency and Failure
- System boundaries, failure domains, topological constraints
- Consistency models: strong, causal, eventual, linearizable
- Tradeoffs: CAP, PACELC, consistency delay windows
- Workshop: Design a log-backed API with defined consistency contract
Readings:
- Lamport — Time, Clocks and the Ordering of Events
- Jepsen — Consistency Models Explained
Day 2 — Scalability Through Structure
Theme: Partitioning, Replication, and Isolation
- Sharding: consistent hashing, prefix-based, zone-aware
- Replication: quorum, async vs sync, follower lag
- Isolation levels: snapshot, serializable, write-skew
- Workshop: Implement sharded write path, simulate stale read and recovery
Readings:
- Spanner: Globally Distributed DB — Google
- Dynamo: Amazon’s Highly Available Key-Value Store
Day 3 — Time, Coordination, and Recovery
Theme: Consensus, Clocks, Durable Recovery
- Raft, Paxos, Multi-Raft, Viewstamped Replication
- Clocks: vector, HLC, skew compensation, leases
- WALs, fencing tokens, log truncation, snapshotting
- Workshop: Add Raft-style leadership and recovery to cohort service
Readings:
- Ongaro — In Search of an Understandable Consensus Algorithm
- Viewstamped Replication Revisited
Day 4 — Architectural Evolution
Theme: Designing for Change and Long-Term Resilience
- Observability-first systems: metrics, traces, health budgets
- Backward compatibility: dual writes, API evolution
- Patterns: outbox, changelogs, control planes
- Capstone: Extend your system — add observability, snapshot versioning, or rolling upgrades
- Collaboration: Initiate a cohort-authored short paper on “System Thinking in Distributed Design” — synthesize cohort insights into a public artifact
Readings:
- Eventual Consistency & The Outbox Pattern
GitHub Architecture Project
This project spans all four days and simulates the evolution of a fault-tolerant distributed system through collaborative GitHub milestones. Each team will maintain:
- A
design.md
document with architecture decisions, consistency contracts, and tradeoff reasoning - Annotated PRs referencing coordination logic, recovery strategies, and observability instrumentation
- Structured issues reflecting real-world failure modes, SLO constraints, and coordination boundaries
Day | Milestone |
---|---|
1 | Formalize system API surface with specified consistency contracts (linearizability, causal, etc.) |
2 | Apply structured sharding (range/hash) and implement write quorum coordination logic |
3 | Integrate Raft-style log coordination, leader election, and recovery fencing |
4 | Extend system with metrics, traces, snapshots; finalize capstone and contribute to cohort-authored paper |
Each team maintains a repo with versioned design.md and issue-based planning.
Instructor
Chiradip Mandal
Writer • Distributed Systems Researcher • Founder • Senior Architect • Principal Engineer
Chiradip has led architecture at both hyperscale companies and startups. His experience includes:
- Multi-time founder and systems advisor
- Author of the upcoming book Designing Ultra-Large-Scale Systems
- Senior Principal Engineer/Architect at Apple
“I’m not here to teach you how to design YouTube. I’m here to help you reason about consistency, concurrency, and failure — and use that reasoning to architect things that don’t fall apart.”
Bonuses & Cohort Continuity
- Early access to Designing Ultra-Large-Scale Systems (pre-release reviewer)
- Private Slack/Discord community for long-term mentorship
- Access to follow-up tracks: CRDTs, streaming architectures, infra control planes
- Peer design reviews + optional architecture office hours
- Publishing high-quality research papers through SystemDesignSchool.com and its partners
Expectations
- Attend all 4 days fully; every module builds on the last
- Push to GitHub daily and participate in peer reviews
- Be prepared to defend design tradeoffs in live discussions
- Build not for interviews — but for systems that survive failure and time