Reliability: Failures, Rate Limiting, Observability

Don't worry if these words sound scary — we'll explain each one along the way. In this lesson we learn how to talk in an interview about ways to keep a system stable even when something goes wrong: timeouts (giving up waiting for an answer after a set time, instead of getting stuck forever), retries

System Design (planning how to build big software that serves many people) is like planning a city: roads, storage, traffic lights, and maintenance crews so the city keeps running smoothly even during rush hour, when everyone is out at the same time.

Reliability and observability: The core idea of this lesson: how we keep the system reliable (still working) and able to see what's happening inside it. It includes: when to stop waiting (timeouts), when to try again (retries), when to block temporarily to prevent a crash (circuit breakers), how to limit how many requests are allowed (rate limits), and how to watch what's going on using metrics, logs, and tracing (following a request).
Trade-off: Trade-off — a conscious choice where you gain one thing and pay for it with another, like picking fast food over a home-cooked meal: you save time but lose some quality. In an interview you explain what you gained and what it cost.
Operational metric: An operational metric — a number that shows whether the decision really works when the system is live and serving real users (this is called production). For example: latency (how long it takes to get an answer), error rate (the share of requests that fail), queue lag (how many tasks are waiting in line), cache hit ratio (how often we found the answer in fast memory), and more.