Skip to content

System Design for Scale, Failure, and Evolution

“You’ll of course be able to design systems like YouTube by the end of this course — but that’s not the point. Our focus is on the underlying principles: consistency, concurrency, coordination, and the realities of failure. Once you master those, designing anything becomes a matter of structured thought — not memorized patterns.”
Chiradip Mandal

Overview

  • Duration: 4 Days × 2 Hours/day
  • Format: 1.5h deep dive + 30m hands-on GitHub workshop
  • Target: Staff+, Principal, L8–L10 engineers
  • Pedagogy: Theory + application + architecture retrospectives
  • Deliverables: GitHub architecture, design docs, book access
  • Post-Cohort: Design circles, long-term cohort mentorship

Learning Goals

  • Master architectural tradeoffs: consistency, concurrency, consensus
  • Build systems with realistic latency, failure, resource constraints
  • Collaboratively evolve an architecture via GitHub PRs
  • Join a durable network of peer reviewers, technologists, and system leaders

Day-by-Day Syllabus

Day 1 — Bounded Correctness

Theme: Designing for Correctness under Latency and Failure

  • System boundaries, failure domains, topological constraints
  • Consistency models: strong, causal, eventual, linearizable
  • Tradeoffs: CAP, PACELC, consistency delay windows
  • Workshop: Design a log-backed API with defined consistency contract

Readings:

  • Lamport — Time, Clocks and the Ordering of Events
  • Jepsen — Consistency Models Explained

Day 2 — Scalability Through Structure

Theme: Partitioning, Replication, and Isolation

  • Sharding: consistent hashing, prefix-based, zone-aware
  • Replication: quorum, async vs sync, follower lag
  • Isolation levels: snapshot, serializable, write-skew
  • Workshop: Implement sharded write path, simulate stale read and recovery

Readings:

  • Spanner: Globally Distributed DB — Google
  • Dynamo: Amazon’s Highly Available Key-Value Store

Day 3 — Time, Coordination, and Recovery

Theme: Consensus, Clocks, Durable Recovery

  • Raft, Paxos, Multi-Raft, Viewstamped Replication
  • Clocks: vector, HLC, skew compensation, leases
  • WALs, fencing tokens, log truncation, snapshotting
  • Workshop: Add Raft-style leadership and recovery to cohort service

Readings:

  • Ongaro — In Search of an Understandable Consensus Algorithm
  • Viewstamped Replication Revisited

Day 4 — Architectural Evolution

Theme: Designing for Change and Long-Term Resilience

  • Observability-first systems: metrics, traces, health budgets
  • Backward compatibility: dual writes, API evolution
  • Patterns: outbox, changelogs, control planes
  • Capstone: Extend your system — add observability, snapshot versioning, or rolling upgrades
  • Collaboration: Initiate a cohort-authored short paper on “System Thinking in Distributed Design” — synthesize cohort insights into a public artifact

Readings:

  • Eventual Consistency & The Outbox Pattern

GitHub Architecture Project

This project spans all four days and simulates the evolution of a fault-tolerant distributed system through collaborative GitHub milestones. Each team will maintain:

  • A design.md document with architecture decisions, consistency contracts, and tradeoff reasoning
  • Annotated PRs referencing coordination logic, recovery strategies, and observability instrumentation
  • Structured issues reflecting real-world failure modes, SLO constraints, and coordination boundaries
DayMilestone
1Formalize system API surface with specified consistency contracts (linearizability, causal, etc.)
2Apply structured sharding (range/hash) and implement write quorum coordination logic
3Integrate Raft-style log coordination, leader election, and recovery fencing
4Extend system with metrics, traces, snapshots; finalize capstone and contribute to cohort-authored paper

Each team maintains a repo with versioned design.md and issue-based planning.


Instructor

Chiradip Mandal

Writer • Distributed Systems Researcher • Founder • Senior Architect • Principal Engineer

Chiradip has led architecture at both hyperscale companies and startups. His experience includes:

  • Multi-time founder and systems advisor
  • Author of the upcoming book Designing Ultra-Large-Scale Systems
  • Senior Principal Engineer/Architect at Apple

“I’m not here to teach you how to design YouTube. I’m here to help you reason about consistency, concurrency, and failure — and use that reasoning to architect things that don’t fall apart.”


Bonuses & Cohort Continuity

  • Early access to Designing Ultra-Large-Scale Systems (pre-release reviewer)
  • Private Slack/Discord community for long-term mentorship
  • Access to follow-up tracks: CRDTs, streaming architectures, infra control planes
  • Peer design reviews + optional architecture office hours
  • Publishing high-quality research papers through SystemDesignSchool.com and its partners

Expectations

  • Attend all 4 days fully; every module builds on the last
  • Push to GitHub daily and participate in peer reviews
  • Be prepared to defend design tradeoffs in live discussions
  • Build not for interviews — but for systems that survive failure and time