Skip to main content

Basics

[HLD] High-Level Design

High-Level Design (HLD) focuses on the overall architecture of a system. It involves defining the major components, their interactions, and the technologies used, without delving into the minute details of implementation. The goal is to create a blueprint that addresses key requirements like scalability, availability, performance, and fault tolerance. HLD typically includes block diagrams, data flow diagrams, and technology choices.


Step 1: Fundamentals

Understanding these core concepts is crucial before diving into complex system designs. They form the building blocks of most modern distributed systems.

  • Serverless vs Serverful:

    • Serverful (Traditional/Server-based):
      • Definition: You manage the underlying servers (physical or virtual machines). This includes provisioning, scaling, patching, and maintenance.
      • Pros: Full control over the environment, predictable performance (once configured), can be cost-effective for stable, high-traffic loads, easier for long-running tasks or stateful applications.
      • Cons: Responsibility for infrastructure management, potential for underutilized resources (paying for idle time), scaling can be slower and more manual (though auto-scaling groups help), upfront capacity planning needed.
      • Examples: EC2 instances on AWS, VMs on Azure/GCP, dedicated servers.
    • Serverless (e.g., FaaS - Functions as a Service, BaaS - Backend as a Service):
      • Definition: The cloud provider manages the server infrastructure. You deploy code (functions) or use managed backend services, and they run in response to events. You don't manage individual servers.
      • Pros: No server management, pay-per-use (cost-effective for sporadic traffic), automatic scaling (scales to zero), faster development cycles for certain applications.
      • Cons: Vendor lock-in, "cold starts" (latency for the first request after idle), execution duration limits, potential for complex debugging/monitoring across many small functions, stateless nature can complicate state management.
      • Examples: AWS Lambda, Azure Functions, Google Cloud Functions, AWS Fargate (for containers), Firebase.
    • Trade-offs: Consider operational overhead, cost model, scalability needs, control requirements, development complexity, and vendor lock-in.
  • Horizontal vs Vertical Scaling:

    • Vertical Scaling (Scaling Up):
      • Definition: Increasing the resources of a single server (e.g., adding more CPU, RAM, faster disk).
      • Pros: Simpler to implement initially, applications may not need modification.
      • Cons: Finite limit to how much you can upgrade a single machine, can be more expensive per unit of resource, often requires downtime for upgrades, still a single point of failure (SPOF).
    • Horizontal Scaling (Scaling Out):
      • Definition: Adding more servers to your pool of resources and distributing the load among them (typically using a load balancer).
      • Pros: Can scale to much larger capacities, improves availability and fault tolerance (if one server fails, others take over), can be more cost-effective for large loads.
      • Cons: More complex to manage, applications often need to be designed to be stateless or handle distributed state, data consistency can be a challenge.
    • Note: Most large-scale systems use horizontal scaling, but individual components might still benefit from some vertical scaling.
  • What are Threads?:

    • Definition: The smallest unit of execution within a process. A process can have multiple threads, all sharing the same memory space (code, data, heap segments) but each having its own stack and registers.
    • Concurrency: Threads allow a single process to perform multiple tasks seemingly simultaneously by rapidly switching between them (on a single CPU core).
    • Parallelism: If multiple CPU cores are available, threads can run truly in parallel, each on a different core.
    • Benefits: Improved responsiveness (e.g., UI thread remains active while background threads do work), efficient resource sharing within a process.
    • Challenges:
      • Race Conditions: Multiple threads accessing shared data, and the outcome depends on the unpredictable order of execution.
      • Deadlocks: Two or more threads waiting for each other to release resources, causing a standstill.
      • Synchronization Overhead: Using locks, mutexes, semaphores to protect shared data adds complexity and can impact performance.
    • Use Cases: Web servers handling multiple client requests, GUI applications, background processing.
  • What are Pages?:

    • Definition: In the context of Operating Systems and memory management, a page is a fixed-size block of virtual memory. Physical memory (RAM) is also divided into fixed-size blocks called frames.
    • Virtual Memory: An OS technique that gives an application program the impression that it has a contiguous working memory (address space), while in reality, it may be physically fragmented and may even overflow onto disk storage.
    • Paging: The process of mapping virtual addresses to physical addresses. The OS maintains a page table for each process, which stores the mapping between virtual pages and physical frames.
    • Page Fault: Occurs when a program tries to access a page that is not currently in physical memory (RAM). The OS must then load the required page from secondary storage (e.g., hard disk) into a frame in RAM, potentially swapping out another page if RAM is full.
    • Role in OS & Performance: Paging enables efficient use of RAM, allows processes larger than physical memory, and provides memory protection. However, frequent page faults (thrashing) can severely degrade performance due to slow disk I/O.
  • How does the Internet Work?:

    • Client-Server Model: Most internet interactions involve a client (e.g., web browser) requesting resources from a server.
    • IP (Internet Protocol): A network layer protocol responsible for addressing (IP addresses) and routing packets of data from source to destination across networks.
    • TCP (Transmission Control Protocol): A transport layer protocol built on top of IP. It provides reliable, ordered, and error-checked delivery of a stream of bytes. Establishes a connection via a "three-way handshake."
    • UDP (User Datagram Protocol): Another transport layer protocol. It's connectionless, faster but less reliable than TCP (no guaranteed delivery or order).
    • DNS (Domain Name System): Translates human-readable domain names (e.g., www.google.com) into numerical IP addresses (e.g., 172.217.160.142) that computers use to locate each other.
      • Process: Browser checks local cache -> OS cache -> Router cache -> ISP's recursive DNS server -> Root DNS servers -> TLD (Top-Level Domain) servers -> Authoritative DNS servers.
    • HTTP (Hypertext Transfer Protocol): An application layer protocol used for transmitting hypermedia documents (e.g., HTML). It's the foundation of data communication for the World Wide Web.
    • HTTPS (HTTP Secure): HTTP over TLS/SSL, providing encryption, authentication, and integrity for secure communication.
    • Routers & Switches: Network devices that direct packets. Routers work at the network layer (IP addresses) to connect different networks. Switches work at the data link layer (MAC addresses) within a local network.
    • Firewalls: Network security systems that monitor and control incoming and outgoing network traffic based on predetermined security rules.
    • Load Balancers: Distribute incoming network traffic across multiple servers.

Step 2: Databases

Databases are fundamental for storing, retrieving, and managing data. The choice of database significantly impacts system performance, scalability, and consistency.

  • SQL vs NoSQL:

    • SQL (Relational Databases):
      • Definition: Store data in tables with predefined schemas (rows and columns). Relationships are defined using foreign keys.
      • Examples: MySQL, PostgreSQL, Oracle, SQL Server, SQLite.
      • Characteristics:
        • ACID Properties: Atomicity, Consistency, Isolation, Durability ensure reliable transactions.
        • Schema-on-write: Schema is defined before data is written.
        • Strong Consistency: Data is consistent across all nodes.
        • Joins: Powerful for querying related data across multiple tables.
      • Pros: Mature technology, strong consistency, good for complex queries and structured data, well-understood.
      • Cons: Can be difficult to scale horizontally (often scaled vertically), less flexible with evolving schemas.
      • Use Cases: Financial systems, applications requiring complex transactions, systems with well-defined, structured data.
    • NoSQL (Non-Relational Databases):
      • Definition: A broad category of databases that don't use traditional relational table structures. Designed for scalability, flexibility, and high performance.
      • Types & Examples:
        • Key-Value Stores: (e.g., Redis, Memcached, Amazon DynamoDB) - Data stored as key-value pairs. Simple, fast.
        • Document Databases: (e.g., MongoDB, Couchbase) - Data stored in document formats like JSON or BSON. Flexible schema.
        • Column-Family Stores: (e.g., Cassandra, HBase) - Data stored in columns rather than rows. Good for write-heavy, wide-row data.
        • Graph Databases: (e.g., Neo4j, Amazon Neptune) - Data stored as nodes and edges, optimized for relationship-heavy data.
      • Characteristics:
        • BASE Properties (Often): Basically Available, Soft state, Eventual consistency.
        • Schema-on-read/Flexible Schema: Schema can evolve, or data can be schemaless.
        • Horizontal Scalability: Designed to scale out across many commodity servers.
        • CAP Theorem Trade-offs: Often prioritize Availability and Partition Tolerance over strong Consistency.
      • Pros: High scalability and availability, flexible data models, good for unstructured or semi-structured data, high performance for specific access patterns.
      • Cons: Eventual consistency can be complex to handle, less mature than SQL for some features (e.g., complex transactions across documents/collections), querying can be less powerful than SQL's JOINs for relational data.
      • Use Cases: Big Data applications, real-time systems, content management, IoT, applications with rapidly evolving requirements.
    • CAP Trade-offs: Discussed in Step 3. NoSQL databases often make different choices in the CAP triangle compared to traditional SQL databases.
  • In-Memory Databases:

    • Definition: Databases that store data primarily in main memory (RAM) instead of on disk.
    • Examples: Redis, Memcached, SAP HANA, VoltDB.
    • Pros:
      • Speed: Extremely fast read and write operations due to RAM access being orders of magnitude faster than disk access.
      • Reduced Latency: Significantly lowers data access times.
    • Cons:
      • Volatility: Data can be lost on power failure if not persisted (though many offer persistence options like snapshots or AOF for Redis).
      • Cost: RAM is more expensive per GB than disk storage.
      • Capacity Limits: Limited by available RAM.
    • Use Cases:
      • Caching: Most common use case to speed up access to frequently requested data from a slower primary database.
      • Session Management: Storing user session data for web applications.
      • Real-time Analytics: Processing rapidly changing data for immediate insights.
      • Leaderboards/Counters: High-speed updates and reads for gaming or social apps.
      • Message Queues/Brokers: Redis Streams or lists can act as lightweight message queues.
  • Data Replication & Migration:

    • Data Replication:
      • Definition: The process of creating and maintaining multiple copies of data on different database servers (replicas).
      • Purpose:
        • High Availability & Fault Tolerance: If one server fails, others can take over.
        • Read Scalability: Distribute read queries across replicas to reduce load on the primary server.
        • Disaster Recovery: Replicas in different geographical locations can protect against regional outages.
        • Reduced Latency: Users can access data from geographically closer replicas.
      • Types:
        • Master-Slave (Primary-Replica): Writes go to the master, which then replicates to slaves. Slaves handle reads. Most common.
        • Master-Master (Multi-Master): Writes can go to any master, which then replicates to other masters. More complex due to conflict resolution.
      • Modes:
        • Synchronous: Write is acknowledged only after it's committed to all (or a quorum of) replicas. Ensures strong consistency but higher latency.
        • Asynchronous: Write is acknowledged after committed to the primary, then replicated in the background. Lower latency but potential for data loss or stale reads if primary fails before replication.
    • Data Migration:
      • Definition: The process of moving data from one storage system, database, or format to another.
      • Reasons: Upgrading hardware/software, changing database vendors, consolidating databases, moving to the cloud, schema changes.
      • Strategies:
        • Big Bang: Stop the old system, migrate data, start the new system. Simple but involves downtime.
        • Trickle/Phased Migration: Migrate data in stages or continuously sync data while both systems run in parallel. More complex, less downtime. (e.g., using Change Data Capture - CDC).
      • Challenges: Downtime minimization, data consistency, data transformation, validation, rollback planning.
  • Data Partitioning (or Table Partitioning in SQL):

    • Definition: Dividing a large database table (or collection in NoSQL) into smaller, more manageable pieces (partitions), while still being treated as a single table logically for queries. The data itself is typically co-located (on the same server or server cluster).
    • Purpose: Improve query performance (scan smaller data sets), simplify maintenance (e.g., archiving old partitions), improve availability.
    • Types:
      • Horizontal Partitioning: Divides a table into multiple smaller tables based on row values (e.g., partitioning sales data by month or region). Each partition has the same schema but contains a different subset of rows. This is closely related to Sharding, but partitioning often implies the partitions reside on the same database server instance or a tightly coupled cluster.
      • Vertical Partitioning: Divides a table into multiple tables by splitting its columns. Some columns go into one table, others into another, linked by a common key. Useful when certain columns are accessed frequently and others (e.g., large BLOBs/TEXT) rarely.
      • Others: Range, List, Hash partitioning.
    • Note: Different from Sharding, where data is distributed across different database servers/nodes. Partitioning can be a prerequisite or complementary technique to sharding.
  • Sharding (a form of Horizontal Partitioning across servers):

    • Definition: A database architecture pattern where data is horizontally partitioned and distributed across multiple independent database servers (shards). Each shard is a separate database, holding a subset of the total data.
    • Purpose:
      • Scalability: Distribute read/write load, storage capacity, and processing power across many servers.
      • Performance: Queries can be directed to specific shards, reducing the amount of data scanned.
    • Shard Key: A specific column (or set of columns) whose value determines which shard a particular row/document resides on.
      • Choosing a good shard key is critical: Aims for even data distribution and query routing. Poor shard keys can lead to "hotspots" (some shards overloaded).
    • Strategies for Shard Key:
      • Range-based sharding: Data divided based on ranges of the shard key (e.g., User IDs 1-1000 on Shard A, 1001-2000 on Shard B).
      • Hash-based sharding: Shard key is hashed, and the hash value determines the shard. Ensures more even distribution but makes range queries difficult.
      • Directory-based sharding: A lookup service/table maps shard keys to shards. Offers flexibility but adds a point of indirection.
    • Challenges:
      • Cross-shard Joins: Difficult and inefficient. Often requires application-level joins or denormalization.
      • Hotspots: Uneven data distribution or access patterns.
      • Re-sharding: Adding/removing shards or rebalancing data can be complex.
      • Transactions: ACID transactions across shards are complex to implement (often requiring two-phase commit, which impacts performance and availability). Many sharded systems relax ACID guarantees.

Step 3: Consistency vs Availability

These concepts are at the heart of distributed system design, particularly when dealing with data replication and partitioning.

  • Data Consistency & Its Levels:

    • Definition: Ensures that all clients see the same view of the data. In a replicated system, it means all replicas eventually hold the same data.
    • Levels:
      • Strong Consistency (Linearizability): The strongest form. All operations appear to occur instantaneously and in some global order. Any read will return the value of the most recent completed write. Often achieved using synchronous replication or consensus algorithms (e.g., Paxos, Raft).
        • Impact: Can increase latency as operations may need to coordinate across replicas.
      • Sequential Consistency: All operations appear to occur in some sequential order, and operations from any individual client appear in the order they were issued by that client. Weaker than linearizability as the global order might not match real-time.
      • Causal Consistency: If operation A happens-before operation B (e.g., A causes B), then all processes see A before B. Operations that are not causally related can be seen in different orders by different processes.
      • Eventual Consistency: A weaker model. If no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. There's no guarantee on when this will happen.
        • Variations:
          • Read-Your-Writes Consistency: After a client writes data, subsequent reads by that same client will see the written data (or a newer version).
          • Monotonic Reads Consistency: If a client reads a value, subsequent reads by that client will never see an older value.
          • Monotonic Writes Consistency: Writes from the same client are applied in the order they were issued.
      • Weak Consistency: Offers no guarantees; reads might return stale data.
    • Trade-off: Stronger consistency models are easier for developers to reason about but typically come at the cost of higher latency and/or lower availability.
  • Isolation & Its Levels (ACID's 'I'):

    • Definition: Controls how/when changes made by one transaction become visible to others. It prevents concurrent transactions from interfering with each other in undesirable ways.
    • Read Phenomena (Problems caused by lack of isolation):
      • Dirty Read: Transaction T1 modifies data, and T2 reads that modified data before T1 commits. If T1 rolls back, T2 has read "dirty" (invalid) data.
      • Non-repeatable Read: Transaction T1 reads data. T2 then modifies or deletes that data and commits. If T1 re-reads the data, it sees a different value or that the data is gone.
      • Phantom Read: Transaction T1 reads a set of rows satisfying a search condition. T2 then inserts new rows (or modifies existing ones) that satisfy T1's search condition and commits. If T1 re-executes its query, it sees new "phantom" rows.
    • SQL Isolation Levels (from weakest to strongest):
      • Read Uncommitted: Allows dirty reads, non-repeatable reads, and phantom reads. Lowest isolation, highest concurrency.
      • Read Committed: Prevents dirty reads. Non-repeatable reads and phantom reads can still occur. (Common default).
      • Repeatable Read: Prevents dirty reads and non-repeatable reads. Phantom reads can still occur. (MySQL's default with InnoDB).
      • Serializable: Prevents all three phenomena. Highest isolation, transactions appear to execute serially. Can significantly reduce concurrency. Often implemented using techniques like two-phase locking (2PL) or Snapshot Isolation (MVCC).
    • Note: Higher isolation levels generally reduce concurrency and can increase contention for resources (locks).
  • CAP Theorem (Brewer's Theorem):

    • Definition: In a distributed data store, it is impossible to simultaneously provide more than two out of the following three guarantees:
      • Consistency (C): Every read receives the most recent write or an error. (This refers to strong consistency/linearizability in the context of CAP).
      • Availability (A): Every request receives a (non-error) response, without the guarantee that it contains the most recent write. The system remains operational even if some nodes fail.
      • Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes (i.e., a network partition).
    • The "Choose Two" Dilemma:
      • In modern distributed systems, network partitions are a fact of life. Therefore, Partition Tolerance (P) is usually a must-have.
      • So, the practical trade-off during a network partition is between Consistency and Availability:
        • CP (Consistency + Partition Tolerance): If a partition occurs, the system may become unavailable for some requests (e.g., writes, or reads from the partitioned side) to maintain consistency. When the partition heals, data is consistent.
          • Example: Banking systems where data accuracy is paramount.
        • AP (Availability + Partition Tolerance): If a partition occurs, all nodes remain available, but they might return stale data as they can't synchronize with other partitioned nodes. Consistency is sacrificed for availability. Data becomes eventually consistent once the partition heals.
          • Example: Social media feeds where showing something (even if slightly old) is better than showing an error.
    • Misconceptions:
      • CAP is not about choosing 2 out of 3 all the time. It's about the trade-off during a network partition. When the network is healthy, systems can often provide C and A.
      • The "C" in CAP is very specific (linearizability). Many systems offer weaker consistency models (like eventual consistency) and can still be highly available and partition tolerant.
      • It's not binary; it's a spectrum. Systems can make nuanced trade-offs.

Step 4: Cache

Caching is a technique to store frequently accessed data in a faster, closer storage layer to reduce latency, decrease load on backend systems, and improve performance.

  • What is Cache?:

    • Definition: A hardware or software component that stores data so that future requests for that data can be served faster. The data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere.
    • Role:
      • Reduce Latency: Accessing data from a fast cache (e.g., in-memory) is quicker than from a slower backend (e.g., disk-based database or remote service).
      • Reduce Load: Offloads requests from backend systems, preventing them from being overwhelmed.
      • Improve Throughput: Systems can handle more requests per second.
    • Where Caching Fits:
      • Client-side: Browser cache, mobile app cache.
      • CDN (Content Delivery Network): Distributed cache for static assets.
      • Load Balancer/Reverse Proxy Cache: Caching responses at the edge.
      • Application-level Cache: In-memory cache within an application instance (e.g., Guava Cache, Ehcache).
      • Distributed Cache: External, shared cache service (e.g., Redis, Memcached).
      • Database Cache: Database's internal buffer pool/cache.
    • Redis, Memcached:
      • Memcached: Simple, high-performance, distributed in-memory key-value store. Good for caching. Volatile by default. Multithreaded.
      • Redis: More feature-rich in-memory data structure store (supports strings, hashes, lists, sets, sorted sets, streams, etc.). Can be used as a cache, database, or message broker. Offers persistence options. Mostly single-threaded (for commands) but uses background threads for I/O, etc.
  • Write Policies (how data is written to cache and backend storage):

    • Write-Through:
      • Process: Data is written to the cache AND the backend database simultaneously (or sequentially, but the operation completes only after both succeed).
      • Pros: High data consistency between cache and DB. Simpler to implement. No data loss on cache failure.
      • Cons: Higher write latency as it involves two write operations.
    • Write-Around:
      • Process: Data is written directly to the backend database, bypassing the cache. Cache is populated on read miss.
      • Pros: Avoids flooding the cache with write-intensive, infrequently read data. Write operations are fast (only to DB).
      • Cons: Higher read latency for recently written data (cache miss) until it's read and cached. Can lead to cache inconsistency if data is updated in DB but old version remains in cache (requires cache invalidation).
    • Write-Back (Write-Behind):
      • Process: Data is written only to the cache. The cache then asynchronously writes the data to the backend database after a delay or when certain conditions are met (e.g., cache block is evicted).
      • Pros: Lowest write latency (only to fast cache). High write throughput. Reduces load on DB.
      • Cons: Risk of data loss if the cache fails before data is persisted to the DB. More complex to implement (tracking dirty blocks). Potential for data inconsistency if other systems access DB directly.
  • Replacement Policies (Eviction Policies - what to remove when cache is full):

    • LFU (Least Frequently Used):
      • Algorithm: Evicts the item that has been accessed the fewest times. Requires tracking access counts for each item.
      • Pros: Good for data with stable popularity (some items always popular, some always not).
      • Cons: Can perform poorly if access patterns change. Recently added items might be evicted quickly if not accessed frequently enough initially. Higher overhead to maintain counts.
    • LRU (Least Recently Used):
      • Algorithm: Evicts the item that has not been accessed for the longest time. Requires tracking the last access time for each item.
      • Pros: Adapts well to changing access patterns. Relatively simple to implement (e.g., using a doubly linked list and a hash map).
      • Cons: Can suffer from "cache pollution" if a large scan of infrequently used items occurs, evicting genuinely useful, older items.
    • Segmented LRU (SLRU):
      • Algorithm: Divides the cache into segments (e.g., a probationary segment for new items and a protected segment for items accessed at least twice). New items go to probationary. A hit in probationary moves it to protected. Eviction typically happens from probationary first, then from the LRU end of protected.
      • Pros: Tries to combine benefits of LRU while mitigating cache pollution by giving new items a chance before being promoted.
      • Cons: More complex than simple LRU.
    • Others: FIFO (First-In, First-Out), LIFO (Last-In, First-Out), Random Replacement (RR), MRU (Most Recently Used).
  • CDNs (Content Delivery Networks):

    • Definition: A geographically distributed network of proxy servers and their data centers.
    • Purpose: Provide high availability and performance by distributing content closer to end-users.
    • How it Works:
      1. User requests content (e.g., image, video, JS/CSS file) from a website.
      2. DNS routes the request to the nearest CDN edge server (Point of Presence - PoP).
      3. If the edge server has the content cached, it serves it directly.
      4. If not (cache miss), the edge server fetches it from the origin server (or another CDN tier), caches it, and then serves it to the user.
    • Benefits:
      • Reduced Latency: Users fetch content from nearby servers.
      • Reduced Origin Load: Offloads traffic from the main web servers.
      • Increased Availability/Redundancy: If one edge server fails, traffic can be routed to others.
      • Scalability: Handles traffic spikes effectively.
      • Security: Can provide DDoS mitigation and other security features.
    • Use Cases: Static assets (images, CSS, JS), video streaming, software downloads, dynamic content acceleration (less common but possible).

Step 5: Networking

Understanding network protocols is vital for designing systems that communicate efficiently and reliably over the internet or internal networks.

  • TCP vs UDP:

    • TCP (Transmission Control Protocol):
      • Type: Connection-oriented.
      • Reliability: Guarantees reliable, ordered delivery of data. Uses acknowledgments, retransmissions for lost packets, and sequence numbers for ordering.
      • Connection Setup: Requires a "three-way handshake" (SYN, SYN-ACK, ACK) to establish a connection before data transfer.
      • Flow Control & Congestion Control: Manages data flow to prevent overwhelming the receiver and adapts to network congestion.
      • Overhead: Higher overhead due to connection setup, ACKs, and state management.
      • Use Cases: HTTP/HTTPS, FTP, SMTP (email), SSH – where reliability is crucial.
    • UDP (User Datagram Protocol):
      • Type: Connectionless.
      • Reliability: No guaranteed delivery, ordering, or error checking (beyond a basic checksum). "Fire and forget."
      • Connection Setup: No handshake; packets (datagrams) are sent without prior arrangement.
      • Flow Control & Congestion Control: None inherently; must be implemented by the application if needed.
      • Overhead: Lower overhead, faster.
      • Use Cases: DNS, DHCP, VoIP, online gaming, video/audio streaming – where speed is more critical than perfect reliability, or where occasional packet loss is tolerable or handled by the application layer.
  • HTTP/1.1 vs HTTP/2 vs HTTP/3 & HTTPS:

    • HTTP/1.1 (1997):
      • Features: Persistent connections (Keep-Alive), pipelining (sending multiple requests without waiting for each response, but responses must be in order).
      • Limitations:
        • Head-of-Line (HOL) Blocking: A slow request/response blocks subsequent ones on the same TCP connection. Pipelining was poorly supported.
        • Text-based: Verbose headers, not compressed by default.
        • Typically uses multiple TCP connections per origin to achieve parallelism, which is resource-intensive.
    • HTTP/2 (2015):
      • Goal: Improve performance by addressing HTTP/1.1 limitations.
      • Key Features:
        • Binary Protocol: More efficient to parse, less error-prone.
        • Multiplexing: Multiple requests and responses can be interleaved over a single TCP connection, eliminating HOL blocking at the HTTP layer (but TCP HOL blocking can still occur).
        • Header Compression (HPACK): Reduces overhead of redundant headers.
        • Server Push: Server can proactively send resources to the client cache that it anticipates the client will need.
        • Stream Prioritization: Client can indicate priority for certain resources.
    • HTTP/3 (2022, built on QUIC):
      • Goal: Further performance improvements, especially for lossy networks.
      • Key Features:
        • Uses QUIC (Quick UDP Internet Connections) instead of TCP. QUIC runs over UDP.
        • Eliminates TCP Head-of-Line Blocking: Since QUIC streams are independent at the transport layer, packet loss in one stream doesn't block others.
        • Faster Connection Establishment: QUIC often combines transport and cryptographic (TLS 1.3) handshakes, reducing round trips (0-RTT or 1-RTT).
        • Improved Congestion Control: More advanced and pluggable congestion control algorithms.
        • Connection Migration: Maintains connections even if client's IP address changes (e.g., switching from Wi-Fi to cellular).
    • HTTPS (HTTP Secure):
      • Definition: HTTP layered over TLS (Transport Layer Security) or its predecessor SSL (Secure Sockets Layer).
      • Purpose: Provides:
        • Encryption: Protects data confidentiality (prevents eavesdropping).
        • Authentication: Verifies the identity of the server (and optionally client) via digital certificates.
        • Integrity: Ensures data has not been tampered with during transit using message authentication codes (MACs).
      • Ubiquitous: Essential for secure web communication. Modern browsers flag non-HTTPS sites as insecure.
  • WebSockets:

    • Definition: A communication protocol that provides full-duplex communication channels over a single TCP connection.
    • How it Works: Initiated via an HTTP "Upgrade" request. Once established, the connection allows bidirectional, persistent communication without HTTP overhead for each message.
    • Characteristics:
      • Bidirectional: Both client and server can send messages independently at any time.
      • Full-duplex: Data can flow in both directions simultaneously.
      • Low Latency: After initial handshake, messages are small and don't carry HTTP header overhead.
      • Stateful: The connection remains open.
    • Use Cases: Real-time applications like chat apps, live sports updates, collaborative editing tools, multiplayer online games, financial trading platforms.
    • Alternatives/Comparisons:
      • HTTP Long Polling: Client sends a request, server holds it open until there's data. Less efficient than WebSockets.
      • Server-Sent Events (SSE): Server pushes data to client (unidirectional server-to-client). Simpler than WebSockets if only server-to-client comms are needed.
  • WebRTC & Video Streaming:

    • WebRTC (Web Real-Time Communication):
      • Definition: An open-source project providing browsers and mobile applications with Real-Time Communication (RTC) capabilities via simple JavaScript APIs.
      • Purpose: Enables peer-to-peer (P2P) audio, video, and generic data communication directly between browsers, without requiring plugins.
      • Components:
        • getUserMedia: Access camera and microphone.
        • RTCPeerConnection: Establishes P2P audio/video connections. Handles NAT traversal (STUN/TURN servers).
        • RTCDataChannel: Bidirectional P2P data transfer.
      • Signaling: WebRTC itself doesn't define signaling. A separate mechanism (e.g., WebSockets, SIP over WebSockets) is needed for peers to discover each other and exchange metadata (session control messages, network configurations, media capabilities).
      • Use Cases: Video conferencing, voice calls, P2P file sharing, screen sharing.
    • Video Streaming (General Concepts & Protocols):
      • Purpose: Deliver video content over the internet efficiently.
      • Key Techniques:
        • Progressive Download: Download the entire file sequentially. Simple, but not good for seeking or live streams.
        • Adaptive Bitrate Streaming (ABS): Video is encoded at multiple bitrates. The player client dynamically switches between streams based on network conditions and device capabilities.
          • HLS (HTTP Live Streaming): Developed by Apple. Segments video into short (e.g., 10-second) chunks (MPEG2-TS files) listed in an M3U8 playlist. Delivered over HTTP. Widely supported.
          • DASH (Dynamic Adaptive Streaming over HTTP): MPEG-DASH. An international standard, similar to HLS but codec-agnostic. Uses XML-based Media Presentation Description (MPD).
        • Real-time Transport Protocol (RTP): Often used with RTCP (RTP Control Protocol) for delivering real-time audio/video over UDP (e.g., in WebRTC, IPTV).
        • RTMP (Real-Time Messaging Protocol): Initially proprietary (Adobe), now open. TCP-based. Used for streaming audio, video, and data between Flash player and a server. Declining for playback, but still common for ingest (sending stream from encoder to media server).
      • Components: Encoders, media servers (for transcoding, packaging, distributing), CDNs, player clients.

Step 6: Load Balancers

Load balancers distribute incoming network traffic or computational workload across multiple backend servers or resources to optimize resource use, maximize throughput, minimize response time, and ensure high availability.

  • Load Balancing Algorithms:

    • Stateless Algorithms (don't consider server state):
      • Round Robin: Distributes requests sequentially to each server in a circular order. Simple, but doesn't account for server load or capacity.
      • Weighted Round Robin: Servers are assigned weights based on their capacity. Servers with higher weights receive proportionally more requests.
      • IP Hash: Uses the client's IP address (or part of it) to determine which server receives the request. Ensures a client is consistently routed to the same server (sticky sessions), which can be useful for stateful applications but can lead to uneven distribution if IP distribution is skewed.
      • Random Choice: Picks a server randomly.
    • Stateful/Dynamic Algorithms (consider server state):
      • Least Connections: Directs traffic to the server with the fewest active connections. Good for long-lived connections.
      • Weighted Least Connections: Combines least connections with server weights.
      • Least Response Time: Directs traffic to the server with the fastest response time and fewest active connections. Requires monitoring server health and response times.
      • Resource-based (e.g., CPU/Memory utilization): Directs traffic to servers with lower CPU/memory load.
    • Layers of Load Balancing:
      • Layer 4 (Transport Layer): Operates at the TCP/UDP level. Makes routing decisions based on IP addresses and ports. Faster, but less context about the traffic.
      • Layer 7 (Application Layer): Operates at the HTTP/HTTPS level. Can inspect content of requests (e.g., URLs, headers, cookies) to make more intelligent routing decisions. Slower but more flexible. Can handle SSL termination, content-based routing.
  • Consistent Hashing:

    • Problem with Simple Hashing: If you use hash(key) % N (where N is number of servers) to distribute requests/data, adding or removing a server changes N, causing most keys to be remapped to different servers. This is disastrous for caches (mass cache invalidation).
    • Definition: A hashing technique where adding or removing a server (or cache node) only requires a small fraction of keys to be remapped, minimizing disruption.
    • How it Works (Simplified):
      1. Servers (nodes) and keys are hashed onto a conceptual ring (e.g., values from 0 to 2^32 - 1).
      2. To find the server for a key, hash the key and walk clockwise on the ring until a server node is found.
      3. When a server is added, it takes over a segment of keys from its successor on the ring.
      4. When a server is removed, its keys are taken over by its successor.
      • Virtual Nodes: To improve distribution and balance, each physical server can be mapped to multiple virtual nodes on the ring.
    • Use Cases: Distributed caches (e.g., Memcached, Riak), load balancing to maintain session stickiness with resilience, sharded databases (e.g., Cassandra, DynamoDB).
  • Proxy & Reverse Proxy:

    • Proxy (Forward Proxy):
      • Definition: An intermediary server that sits between client computers and the internet. Clients are configured to send requests through the proxy.
      • Direction: Client -> Proxy -> Internet Server.
      • Purpose:
        • Client Anonymity/Privacy: Hides client IP addresses from the destination server.
        • Caching: Can cache frequently accessed content to speed up access for multiple clients.
        • Filtering/Security: Block access to certain websites or content.
        • Bypass Geo-restrictions: Access content restricted to certain regions (if proxy is in that region).
    • Reverse Proxy:
      • Definition: An intermediary server that sits in front of one or more web servers (backend servers). It accepts requests from clients on behalf of the backend servers.
      • Direction: Client -> Internet -> Reverse Proxy -> Backend Server(s).
      • Purpose:
        • Load Balancing: Distribute client requests across multiple backend servers.
        • SSL/TLS Termination: Offloads SSL/TLS encryption/decryption from backend servers.
        • Caching: Caches static and dynamic content to reduce load on backends.
        • Compression: Compresses responses to reduce bandwidth.
        • Security (Web Application Firewall - WAF): Protects backend servers from attacks (e.g., DDoS, SQL injection).
        • Request Routing/URL Rewriting: Route requests to different backend services based on URL path, headers, etc.
        • API Gateway: A specialized reverse proxy acting as a single entry point for microservices, handling auth, rate limiting, etc.
      • Examples: Nginx, HAProxy, Apache (with mod_proxy), Cloudflare, AWS ELB/ALB.
  • Rate Limiting:

    • Definition: A mechanism to control the number of requests a client (or all clients) can make to a service within a specified time window.
    • Purpose:
      • Prevent Abuse/DoS Attacks: Limit impact from malicious bots or misbehaving clients.
      • Ensure Fair Usage: Prevent a single user from consuming too many resources.
      • Cost Control: Limit usage of paid APIs or resource-intensive operations.
      • System Stability: Protect backend services from being overwhelmed.
    • Implementation Strategies/Algorithms:
      • Token Bucket: A bucket holds tokens, refilled at a fixed rate. Each request consumes a token. If no tokens, request is denied/queued. Allows bursts.
      • Leaky Bucket: Requests are added to a queue (bucket). Processed at a fixed rate. If queue is full, new requests are denied. Smooths out traffic.
      • Fixed Window Counter: Count requests within a fixed time window (e.g., 100 requests/minute). Can allow bursts at window edges.
      • Sliding Window Log: Store timestamps of requests. Count requests in the past window. Accurate but high memory.
      • Sliding Window Counter: Combines fixed window's low memory with better edge behavior.
    • Where to Implement: API Gateways, load balancers, individual microservices, web servers.
    • Response to Exceeding Limit: HTTP 429 (Too Many Requests) error, queuing, throttling (slowing down).

Step 7: Message Queues

Message queues enable asynchronous communication between different parts of a system, promoting decoupling, scalability, and resilience.

  • Asynchronous Processing:

    • Definition: A model where tasks are initiated but the system doesn't wait for them to complete before moving on. The task is offloaded to another component (e.g., a worker service) for processing at a later time.
    • Message Queues Role: Act as an intermediary buffer. A "producer" service sends a message (task description) to the queue. A "consumer" service picks up messages from the queue and processes them.
    • Benefits:
      • Improved Responsiveness: The primary service (e.g., API) can quickly acknowledge a request after enqueuing the task, without waiting for the (potentially long) task to finish.
      • Increased Resilience: If a consumer fails, messages remain in the queue and can be processed when the consumer recovers.
      • Load Leveling/Buffering: Smooths out spikes in traffic. If producers generate messages faster than consumers can process, the queue absorbs the backlog.
      • Decoupling: Producers and consumers don't need to know about each other directly, only about the queue. They can be scaled, updated, and deployed independently.
    • Use Cases:
      • Background job processing (e.g., image resizing, report generation, sending emails).
      • Inter-service communication in microservices.
      • Data ingestion pipelines.
      • Delayed task execution.
    • Examples:
      • Kafka: High-throughput, distributed streaming platform. Often used for log aggregation, real-time analytics, event sourcing. Provides durable storage of messages.
      • RabbitMQ: Mature, feature-rich message broker supporting multiple messaging protocols (AMQP, MQTT, STOMP). Good for complex routing.
      • Amazon SQS (Simple Queue Service): Fully managed message queuing service.
      • Redis Streams: Can be used as a lightweight message queue/log.
  • Publisher–Subscriber (Pub/Sub) Model:

    • Definition: A messaging pattern where "publishers" send messages without targeting specific "subscribers." Instead, messages are sent to "topics" or "channels." "Subscribers" express interest in one or more topics and receive messages sent to those topics.
    • Characteristics:
      • Decoupling: Publishers and subscribers are completely decoupled. They don't know about each other's existence.
      • Many-to-Many: Multiple publishers can send to a topic, and multiple subscribers can listen to the same topic.
      • Asynchronous: Communication is typically asynchronous.
    • Message Queue vs Pub/Sub:
      • Queue (Point-to-Point): A message is typically consumed by only one consumer. If multiple consumers listen to a queue, they compete for messages (work distribution).
      • Pub/Sub (Topic-based): A message sent to a topic is delivered to all active subscribers of that topic (broadcast).
      • Some systems (e.g., Kafka, RabbitMQ with fanout exchange) can support both patterns.
    • Use Cases:
      • Event-driven architectures: Notifying multiple services about an event (e.g., "new user registered," "order placed").
      • Real-time data dissemination (e.g., stock price updates, news feeds).
      • Fan-out operations where one event triggers multiple independent actions.
    • Event-Driven Architecture (EDA) Basics:
      • Systems built around producing, detecting, consuming, and reacting to events.
      • Events represent significant occurrences or changes in system state.
      • Services communicate by publishing events and subscribing to events they care about, often via a message broker or event bus.

Step 8: Monoliths vs Microservices

Architectural choices that significantly impact how applications are developed, deployed, scaled, and maintained.

  • Monoliths:

    • Definition: An application built as a single, unified unit. All components (UI, business logic, data access layer) are packaged and deployed together.
    • Pros: Simpler to develop initially, easier testing (all in one place), straightforward deployment (single artifact).
    • Cons: Can become large and complex (difficult to understand and maintain), scaling challenges (must scale the entire application even if only one part is a bottleneck), technology stack is fixed, a bug in one module can bring down the entire application, slower development cycles as codebase grows.
  • Microservices:

    • Definition: An architectural style where an application is composed of small, independent, and loosely coupled services. Each service is self-contained, focuses on a specific business capability, and can be developed, deployed, and scaled independently.
    • Why Microservices?:
      • Modularity: Services are small and focused, easier to understand and manage.
      • Independent Deployability: Changes to one service don't require redeploying the entire application. Faster release cycles.
      • Independent Scalability: Each service can be scaled independently based on its specific needs.
      • Technology Diversity: Different services can be built using different technologies/languages best suited for their task.
      • Resilience/Fault Isolation: Failure in one service (if designed well) might not bring down the entire application.
      • Team Autonomy: Small, focused teams can own individual services.
    • Challenges:
      • Distributed System Complexity: Managing multiple services introduces complexities of distributed systems (network latency, fault tolerance, message passing).
      • Operational Overhead: More services to deploy, monitor, and manage. Requires robust automation (CI/CD, infrastructure-as-code).
      • Inter-service Communication: Need to handle network calls, serialization, potential failures.
      • Distributed Transactions: Complex to implement; often avoided in favor of eventual consistency and sagas.
      • Testing: End-to-end testing becomes more complex.
      • Service Discovery: How services find each other.
      • Data Consistency: Maintaining consistency across services with separate databases.
  • Single Point of Failure (SPOF):

    • Definition: A component in a system whose failure will cause the entire system to fail.
    • Designing for High Availability:
      • Redundancy: Having multiple instances of critical components (servers, databases, load balancers, etc.). If one fails, others take over.
      • Failover: Mechanisms to automatically switch to a redundant component upon failure.
      • No SPOFs: Analyze the architecture to identify and eliminate any single points of failure. This applies to hardware, software, network paths, and even data centers (via multi-region deployments).
      • Microservices Context: While individual microservices can fail, the goal is that the overall application remains partially or fully functional. However, critical shared services like an API Gateway or service discovery can become SPOFs if not made redundant.
  • Avoiding Cascading Failures:

    • Definition: A failure in one part of a system triggers failures in other dependent parts, potentially leading to a widespread outage.
    • Techniques:
      • Circuit Breakers:
        • Pattern: Wraps calls to a remote service. If the service becomes unresponsive or returns too many errors, the circuit breaker "opens," and subsequent calls fail immediately (or return a fallback) without attempting to contact the failing service. After a timeout, it enters a "half-open" state to test if the service has recovered.
        • Benefits: Prevents overwhelming a struggling service, reduces client-side latency, allows services to recover.
      • Fallbacks: Provide a default response or alternative functionality if a service call fails (e.g., show cached data, a generic message).
      • Timeouts: Set aggressive timeouts for inter-service calls. Prevents threads from being blocked indefinitely waiting for a slow or unresponsive service.
      • Retries (with Exponential Backoff & Jitter): Retry failed calls, but with increasing delays between retries and some randomness (jitter) to avoid thundering herd problems.
      • Bulkheads: Isolate resources (e.g., thread pools, connection pools) used for different services. Failure or overload in one service interaction doesn't exhaust resources needed for others.
      • Rate Limiting/Throttling: (As discussed earlier) Prevents services from being overwhelmed.
  • Containerization (e.g., Docker):

    • Definition: A lightweight form of virtualization that allows applications and their dependencies to be packaged into isolated units called containers. Containers run on a shared OS kernel but have their own isolated user space.
    • Docker: A popular platform for building, shipping, and running containers.
    • Benefits:
      • Consistency: "Works on my machine" problem solved. Same environment from development to production.
      • Portability: Containers can run on any system that supports the containerization engine.
      • Isolation: Processes in a container are isolated from others on the host and other containers.
      • Efficiency: Lightweight compared to VMs (share OS kernel, faster startup).
      • Rapid Deployment & Scalability: Easy to spin up new instances of services.
      • Microservices Enabler: Ideal for packaging and deploying individual microservices.
    • Container Orchestration (e.g., Kubernetes, Docker Swarm, AWS ECS): Tools for automating the deployment, scaling, management, and networking of containerized applications.
  • Migrating to Microservices (from Monolith):

    • Challenges: Risky, time-consuming, requires cultural and organizational changes.
    • Strategies:
      • Strangler Fig Pattern: Gradually build new functionality as microservices around the existing monolith. Route traffic to new services for these features. Over time, functionality is "strangled" out of the monolith until it can be retired.
      • Break out Bounded Contexts: Identify distinct business capabilities (bounded contexts in Domain-Driven Design) within the monolith and extract them as separate services one by one.
      • Start with New Features: Implement all new features as microservices from the outset, integrating them with the monolith as needed.
      • Anti-Corruption Layer: An adapter/facade between the monolith and new microservices to prevent tight coupling and translate data models.
    • Considerations: Define clear service boundaries, establish robust inter-service communication, implement comprehensive monitoring and logging, invest in CI/CD automation.

Step 9: Monitoring & Logging

Essential for understanding system behavior, diagnosing problems, ensuring performance, and detecting security incidents.

  • Logging & Metrics:

    • Logging:
      • Definition: Recording discrete events that occur within the system (e.g., errors, warnings, informational messages, requests).
      • What to Log:
        • Requests (URL, method, status code, latency, user ID).
        • Errors and exceptions (with stack traces).
        • Key business events (e.g., order placed, payment processed).
        • Security-relevant events (e.g., login attempts, permission changes).
        • Service lifecycle events (startup, shutdown).
      • Best Practices:
        • Structured Logging: Log messages in a consistent format (e.g., JSON) with key-value pairs. Easier to parse, search, and analyze.
        • Log Levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR, CRITICAL) to filter and prioritize.
        • Correlation IDs: Include a unique ID that traces a request across multiple services/components.
        • Centralized Logging: Ship logs from all services to a central logging system (e.g., ELK Stack - Elasticsearch, Logstash, Kibana; Splunk; Grafana Loki).
        • Avoid Sensitive Data: Don't log PII, passwords, API keys, etc.
    • Metrics (Monitoring):
      • Definition: Numerical measurements of system health and performance over time (time-series data).
      • What to Monitor:
        • System-level Metrics (Host/OS): CPU utilization, memory usage, disk I/O, disk space, network traffic.
        • Application-level Metrics: Request rate, error rate, latency (average, percentiles like p95, p99), throughput, queue lengths, connection pool usage.
        • Business Metrics: Number of active users, transactions per second, revenue, conversion rates.
        • Database Metrics: Query latency, connection count, replication lag.
        • Cache Metrics: Hit/miss ratio, evictions.
      • Tools:
        • Collection: Prometheus, Telegraf, StatsD.
        • Storage: Time-series databases (TSDBs) like Prometheus, InfluxDB, TimescaleDB.
        • Visualization & Alerting: Grafana, Kibana, Prometheus Alertmanager.
      • Dashboards: Visual displays of key metrics to provide an overview of system health.
      • Alerting: Automated notifications when metrics cross predefined thresholds or anomalies are detected.
  • Anomaly Detection:

    • Definition: The process of identifying data points, events, or observations that deviate significantly from the normal behavior of a system.
    • Purpose: Early detection of problems, security breaches, performance degradation, or unusual patterns.
    • In Logs:
      • Sudden increase in error logs.
      • Appearance of unusual log messages.
      • Changes in log volume or patterns.
    • In Metrics:
      • Spikes or drops in request rates, latency, error rates.
      • Unusual resource utilization (CPU, memory).
      • Deviations from historical trends or seasonal patterns.
    • Techniques:
      • Statistical Methods: Thresholding (static or dynamic), moving averages, standard deviations.
      • Machine Learning: Clustering, classification, forecasting models (e.g., ARIMA, Prophet) to identify deviations from predicted behavior.
      • Rule-based Systems: Define rules based on domain knowledge.
    • Challenges: Defining "normal," minimizing false positives (alert fatigue) and false negatives, adapting to evolving system behavior.

Step 10: Security

Building secure systems is paramount to protect data, ensure user trust, and comply with regulations. Security should be considered at every stage of design and development ("Shift Left").

  • Tokens for Auth (Authentication & Authorization):

    • Authentication (AuthN): Verifying who a user or service is.
    • Authorization (AuthZ): Determining what an authenticated user or service is allowed to do.
    • Tokens: Pieces of data that carry identity and/or permission information, typically issued by an authentication server after successful login.
    • JWT (JSON Web Token):
      • Definition: An open standard (RFC 7519) for creating self-contained access tokens that assert claims as a JSON object. They are signed (e.g., HMAC or RSA) to ensure integrity and authenticity, and can be optionally encrypted.
      • Structure: Header (algorithm, token type), Payload (claims like user ID, roles, expiry), Signature.
      • Stateless: The server doesn't need to store session state for JWTs; it just validates the signature. Good for microservices.
      • Pros: Stateless, self-contained, widely adopted.
      • Cons: Cannot be easily revoked before expiry (unless a blacklist is maintained, which adds state), can grow large if many claims are included.
    • Opaque Tokens (Reference Tokens):
      • Definition: Random, unique strings that don't contain user information directly. They act as a reference to session data stored on the server-side (e.g., in a database or cache).
      • Stateful: The server must look up the token in its store to validate it and retrieve associated user/session info.
      • Pros: Can be easily revoked, don't expose user info in the token itself, smaller size.
      • Cons: Requires server-side storage and lookup (adds latency, potential SPOF for the session store), less suitable for purely stateless architectures.
    • Usage: Typically sent in the Authorization HTTP header (e.g., Authorization: Bearer <token>).
  • SSO & OAuth:

    • SSO (Single Sign-On):
      • Definition: An authentication scheme that allows a user to log in with a single ID and password to gain access to multiple, independent software systems.
      • How it Works: User authenticates with a central Identity Provider (IdP). The IdP then issues an assertion (e.g., SAML assertion, OIDC ID Token) that Relying Parties (RPs or service providers) can trust to grant access.
      • Benefits: Improved user experience, centralized identity management, better security control.
      • Protocols: SAML (Security Assertion Markup Language), OpenID Connect (OIDC).
    • OAuth 2.0 (Open Authorization):
      • Definition: An authorization framework (not authentication) that enables third-party applications to obtain limited access to an HTTP service, either on behalf of a resource owner or by allowing the third-party application to obtain access on its own behalf.
      • Purpose: Delegated authorization. E.g., allowing a "photo printing app" to access your photos on "Google Photos" without giving it your Google password.
      • Roles:
        • Resource Owner: User who owns the data.
        • Client: Third-party application requesting access.
        • Authorization Server: Issues access tokens to the client after successful authentication of the resource owner and obtaining consent.
        • Resource Server: Hosts the protected resources (e.g., API). Validates access tokens.
      • Flows (Grant Types): Authorization Code (most common for web apps), Implicit (legacy, less secure), Resource Owner Password Credentials (for trusted clients), Client Credentials (for machine-to-machine).
      • Access Tokens: Issued by the Authorization Server to the Client. Client uses it to access Resource Server.
    • OpenID Connect (OIDC):
      • Definition: An identity layer built on top of OAuth 2.0. It allows clients to verify the identity of the end-user based on the authentication performed by an Authorization Server, as well as to obtain basic profile information.
      • Adds Authentication to OAuth: Provides an ID Token (a JWT) in addition to an Access Token.
  • Access Control Lists (ACLs) & Rule Engines:

    • Purpose: Enforce authorization – what actions an authenticated subject (user, service) can perform on specific objects (resources).
    • Access Control Lists (ACLs):
      • Definition: A list of permissions attached to an object. Each entry specifies a subject and the operations they are allowed (or denied) on that object.
      • Pros: Simple, fine-grained control.
      • Cons: Can be difficult to manage at scale (many objects, many users -> many ACL entries), hard to get an overview of a user's permissions.
    • Role-Based Access Control (RBAC):
      • Definition: Permissions are associated with roles, and users are assigned to roles. Users inherit permissions from their assigned roles.
      • Pros: Easier to manage than ACLs (manage roles, not individual user-object permissions), reflects organizational structures.
      • Cons: Can be less flexible for very fine-grained or dynamic permissions.
    • Attribute-Based Access Control (ABAC):
      • Definition: Access decisions are based on attributes of the subject, object, requested action, and environment (e.g., time of day, location). Policies (rules) define these conditions.
      • Pros: Highly flexible, fine-grained, context-aware. Can express complex authorization logic.
      • Cons: More complex to design and implement. Evaluating policies can have performance implications.
    • Rule Engines:
      • Definition: Software systems that execute one or more business rules in a runtime production environment. Can be used to implement ABAC or complex authorization logic.
      • How it Works: Rules (e.g., "IF user.department == 'Sales' AND resource.type == 'Contact' AND action == 'Read' THEN PERMIT") are defined. The engine evaluates these rules against incoming requests and attributes.
      • Examples: Drools, Open Policy Agent (OPA).
  • Encryption:

    • Purpose: Protect data confidentiality and integrity.
    • TLS (Transport Layer Security) / SSL (Secure Sockets Layer):
      • Definition: Cryptographic protocols that provide secure communication over a computer network (e.g., for HTTPS, FTPS, SMTPS).
      • Ensures:
        • Encryption: Data is unreadable to eavesdroppers (data-in-transit).
        • Authentication: Verifies server identity (and optionally client identity) using certificates.
        • Integrity: Detects if data has been tampered with using MACs.
      • Handshake: A process to negotiate cipher suites, exchange keys, and authenticate.
    • Data-at-Rest Encryption:
      • Definition: Encrypting data that is stored on disk (e.g., in databases, file systems, backups).
      • Methods:
        • Full-Disk Encryption (FDE): Encrypts entire drive (e.g., BitLocker, LUKS). Protects if physical disk is stolen.
        • Database Encryption:
          • Transparent Data Encryption (TDE): Encrypts database files at rest. The DB handles encryption/decryption transparently.
          • Column-level/Field-level Encryption: Encrypts specific sensitive columns/fields within the database. Application might need to handle encryption/decryption.
        • File/Folder Encryption: Encrypts individual files or folders.
      • Key Management: Crucial. Securely storing, managing, and rotating encryption keys is vital (e.g., using Hardware Security Modules - HSMs, or managed services like AWS KMS, Azure Key Vault).
    • Data-in-Transit Encryption:
      • Definition: Encrypting data as it moves between systems (e.g., client to server, server to server).
      • Methods: Primarily TLS/SSL. Also VPNs, SSH, IPsec.
      • Internal Traffic: Important to encrypt data not just over the public internet, but also between services within your own network (defense in depth). Service meshes (e.g., Istio, Linkerd) can help automate mTLS (mutual TLS) for inter-service communication.

Step 11: System Design Tradeoffs

Every system design decision involves trade-offs. There's rarely a single "best" solution; instead, solutions are optimized for specific requirements and constraints. Understanding these trade-offs is key to good design.

  • Push vs Pull Architecture:

    • Push: Server proactively sends data to clients when new information is available.
      • Pros: Real-time updates, lower latency for clients (no polling delay).
      • Cons: Can be more complex for server to manage connections and client state, potentially higher server load if updates are frequent or many clients.
      • Examples: WebSockets, Server-Sent Events (SSE), push notifications.
    • Pull (Polling): Client periodically requests data from the server to check for updates.
      • Pros: Simpler for server, client controls update frequency.
      • Cons: Latency (data is only as fresh as the polling interval), can be inefficient (many requests for no new data), can overload server if polling interval is too short or many clients.
      • Variations: Long Polling (client request held open by server until data is available or timeout).
      • Examples: Client periodically calling an API endpoint.
    • Trade-off: Real-time needs vs server complexity/load, client resource usage.
  • Consistency vs Availability (Revisited):

    • As per CAP Theorem, during a network partition, you must choose between maintaining strong consistency (potentially sacrificing availability) or maintaining availability (potentially sacrificing strong consistency, opting for eventual consistency).
    • Trade-off: How critical is it for all users to see the absolute latest data, versus how critical is it for the system to always be responsive, even if data is slightly stale? Varies by use case (e.g., banking vs social media).
  • SQL vs NoSQL Dbs (Revisited):

    • SQL:
      • Pros: Strong consistency (ACID), mature, powerful querying (JOINs), good for structured, relational data.
      • Cons: Harder to scale horizontally, less flexible schema.
    • NoSQL:
      • Pros: High horizontal scalability, flexible schema, good for unstructured/semi-structured data, often higher availability.
      • Cons: Eventual consistency common, less powerful JOINs, varying levels of transactional support.
    • Trade-off: Data model structure, consistency requirements, scalability needs, query complexity, development velocity. Many systems use a polyglot persistence approach (multiple DB types).
  • Memory vs Latency:

    • General Principle: Using more memory (RAM) can often reduce latency.
    • Caching: Storing frequently accessed data in memory (cache) is much faster than retrieving it from disk or a remote service.
    • In-Memory Databases: Offer lowest latency by keeping entire dataset in RAM.
    • Trade-off: Cost of RAM vs performance gains. How much data can/should be cached? How much memory can be allocated to databases?
    • Consideration: Data that is "hot" (frequently accessed) benefits most from being in memory.
  • Throughput vs Latency:

    • Throughput: The amount of work done or data processed per unit of time (e.g., requests/second, messages/second).
    • Latency: The time taken for a single operation or request to complete (e.g., response time).
    • Relationship: Often inversely related, but not always.
      • Optimizing for low latency might involve processing each request immediately, potentially limiting overall throughput if resources are constrained.
      • Optimizing for high throughput might involve batching requests or parallel processing, which can sometimes increase the latency for individual requests within a batch.
    • Example: A message queue can increase throughput by decoupling producer and consumer, but adds latency as messages wait in the queue.
    • Trade-off: System's primary goal – handling many concurrent users/operations (throughput) or providing fast responses for individual operations (latency). Often a balance is needed.
  • Accuracy vs Latency:

    • Context: Common in systems involving complex computations, machine learning, recommendations, or search.
    • Principle: More accurate results often require more computation or data, leading to higher latency. Faster results might involve approximations or less data, potentially reducing accuracy.
    • Examples:
      • Search Engines: A very precise search might take longer than a quicker, slightly less relevant one.
      • Recommendation Systems: A highly personalized recommendation model might be slower to compute than a simpler, more general one.
      • Data Analytics: Exact calculations on huge datasets vs. approximate results using sampling.
    • Trade-off: User experience impact of latency vs. the value of higher accuracy. Can approximate results be "good enough"? Can computations be done offline/asynchronously to improve online latency?

Step 12: Practice, Practice, Practice

Applying the learned concepts by designing well-known systems is the best way to solidify understanding and develop design intuition. For each system, consider:

  1. Functional Requirements: What should the system do?
  2. Non-Functional Requirements (NFRs): Scalability, availability, latency, consistency, durability, security, cost, maintainability.
  3. Capacity Estimation/Constraints: Users, traffic, data storage.
  4. High-Level Design: Major components and their interactions (block diagram).
  5. API Design: Key API endpoints.
  6. Data Model Design: Choice of database(s), schema.
  7. Detailed Design/Deep Dive: Focus on specific challenging components (e.g., news feed generation, chat message delivery, real-time location tracking).
  8. Identify Bottlenecks & Scale: How to scale different parts of the system.
  9. Trade-offs: Discuss key design decisions and their implications.
  1. YouTube: Video uploading, processing, storage, streaming, recommendations, comments, search. (Key challenges: video storage/CDN, transcoding, recommendation engine, scaling).
  2. Twitter/X: Tweeting, timeline generation, followers, search, trending topics. (Key challenges: fan-out for tweets, real-time timeline, scaling reads/writes).
  3. WhatsApp/Messaging App: Real-time chat (1-1, group), presence (online/offline, last seen), media sharing, end-to-end encryption. (Key challenges: message routing/delivery, presence, scaling connections, E2EE).
  4. Uber/Ride-Sharing: Location tracking, driver-rider matching, pricing, payments, reviews. (Key challenges: real-time location, geospatial indexing, matching algorithm, dynamic pricing).
  5. Amazon/E-commerce: Product catalog, search, recommendations, shopping cart, orders, payments, inventory, reviews. (Key challenges: scaling product catalog, inventory consistency, recommendation, order processing).
  6. Dropbox / Google Drive/File Sync: File storage, synchronization across devices, sharing, versioning. (Key challenges: efficient sync algorithm, conflict resolution, large file handling, storage scalability).
  7. Netflix/Streaming Service: Video catalog, streaming, recommendations, user profiles, adaptive bitrate. (Key challenges: video encoding/DRM, CDN, recommendation, personalized experience, scaling for peak demand).
  8. Instagram/Photo Sharing: Image/video upload, feed generation, followers, stories, search, explore. (Key challenges: image/video processing/storage, feed ranking, scaling for viral content).
  9. Zoom/Video Conferencing: Real-time audio/video, screen sharing, chat, recording, large meetings/webinars. (Key challenges: low-latency AV streaming, media mixing/routing (SFU/MCU), scaling for many participants, NAT traversal).
  10. Booking.com / Airbnb/Reservation System: Search listings, availability checking, booking, payments, reviews. (Key challenges: managing inventory/availability in real-time, search with many filters, handling concurrent bookings).