# Scaling Reliably

How do we know if a system is scaling reliably? A system is considered reliable if it can consistently perform and meet requirements despite changes in its environment over time. Achieving this requires the system to detect failures, self-heal automatically, and scale according to demand.

The CAP theorem is a conclusion made by computer scientist Eric Brewer. It's a popular and fairly useful way to think about tradeoffs in the guarantees that a system design makes.

<figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FwkVHSONMS5jVeaCJYA8J%2Fimage.png?alt=media&#x26;token=748ce16f-6b60-46f0-ad6f-518a5aac71f0" alt="" width="304"><figcaption><p>CAP Theorem</p></figcaption></figure>

* Consistency: all nodes see the same data at the same time.
* Availability: node failures do not prevent survivors from continuing to operate.
* Partition tolerance: the system continues to operate despite message loss due to network and/or node failure

<figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FGUsmkofuOX549fTQAY0B%2Fimage.png?alt=media&#x26;token=0c67cc4b-fdc0-4c59-98d8-ae6fd9de57c5" alt=""><figcaption></figcaption></figure>

## Configuration Management

### **Terraform**:

* Use to define infrastructure as code (IaC).
* **Terragrunt** to keep things DRY (Don't repeat yourself).

  <figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2Fys4BlBNnuOo9eGnMOdLI%2Fimage.png?alt=media&#x26;token=65844a98-584e-4c4a-890c-64726f687a86" alt=""><figcaption></figcaption></figure>
* Example: [Multi-region deployments with provider configurations](https://github.com/chrismarget/multi-region-terraform-example) (using **alias**).

  ```
  # The default provider configuration; resources that begin with `aws_` will use
  # it as the default, and it can be referenced as `aws`.
  provider "aws" {
    region = "us-east-1"
  }

  # Additional provider configuration for west coast region; resources can
  # reference this as `aws.west`.
  provider "aws" {
    alias  = "west"
    region = "us-west-2"
  }
  ```

### **Ansible/Chef/Puppet**:

* Automate configuration and ensure consistency across environments.

### **Git**:

* Version control for infrastructure and configuration files.
* Example: Gitops CI/CD (merge to main deploys to one environment, git tags deploy to another, etc.)

### **LaunchDarkly:**

* **User Segmentation:** Target specific users or user segments based on attributes such as location, role, or subscription tier.
* **Custom Conditions:** Use logic to roll out the feature only to users who meet specific criteria.
* **Percentage Rollouts:** Gradually enable the feature for a small percentage of users and incrementally increase the percentage.

***

## Kubernetes (K8s)

### **Multi-region Clusters**:

* Use cluster federation for workload distribution (**KubeFed is the official tool**).

<div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FZIOnvrgQyO6rRafXfBAf%2Fimage.png?alt=media&#x26;token=d4663f95-e6b5-493c-8be3-4536c6875ca5" alt="" width="375"><figcaption></figcaption></figure></div>

* **What:** Cluster federation is a method in Kubernetes to combine multiple clusters across different regions into a single super-cluster that can be controlled through a single interface.
  * The Kubernetes Control Plane handles apps across a group of worker nodes. The Federated Control Plane does the same thing, but it does it across a group of clusters instead of nodes.
* **Why:** Allows you to schedule workloads (apps/pods) dynamically across clusters based on factors like geographic location, resource availability, or latency requirements.
  * Aims to provide **high availability**
* **Examples**: If you have users in North America and Europe, federation can ensure workloads are distributed to clusters closest to users, reducing latency.
  * **Namespace**: Creates a namespace with the same name in all clusters.
  * **ConfigMap**: Replicated in all clusters within the same namespace.
  * **Deployments/ReplicaSets**: Adjusted to distribute replicas fairly across clusters (can be customized).
    * A federated Deployment with 10 replicas spreads these 10 pods fairly across clusters (not 10 pods per cluster).
  * **Ingress**: Creates a unified access point across clusters, not individual Ingress objects.
* Set up regional failover with DNS routing (e.g., Route 53).

### **Scheduling Pods Across Nodes and AZs**

* This method helps distribute replicas across multiple nodes and availability zones (AZs) for better fault tolerance and availability.
* **Anti-Affinity**: Add `podAntiAffinity` to a **Deployment** resource to avoid placing replicas on the same node or availability zone.
* **Weight**: `weight: 100` increases the preference for spreading across AZs.

#### Kafka-Specific Deployments

* **Partitioning**:
  * Distribute partitions across brokers for load balancing.
* **Replication**:
  * Configure replication factor for fault tolerance.
  * Multi-region replication for disaster recovery.
* **Producers/Consumers**:
  * Optimize producer ACK settings for latency vs. durability.
  * Monitor consumer lag using Kafka Connect.

***

## Observability

### Logging

* **Centralized Systems**:
  * ELK Stack (Elasticsearch, Logstash, Kibana).
  * Splunk for large-scale log management.
* **Structured Logging**:
  * Use JSON formatting for better parsing.
  * Include trace IDs for distributed tracing.
* **Retention Policies**:
  * Configure log rotation and archive policies to manage costs.

### Monitoring

#### **What to monitor**

* Four Golden Signals: Latency, Traffic, Error rates, Saturation.
* **Errors** golden signal measures the rate of requests that fail.
  * Usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.
* **Latency** is defined as the time it takes to serve a request.
  * Use Histograms (Ex: P99)
  * Talk by Gil Tene: <https://youtu.be/lJ8ydIuPFeU?t=711>&#x20;
    * "Typical user sessions involve 5 page loads with over 40 resources per page"
      * Given this, some percentiles like P95 might be useless to monitor
      * Instead of focusing solely on a few percentiles (e.g., P50, P90, P99), use **HDR histograms** (High Dynamic Range histograms) to capture the full distribution of latencies. These histograms can show the range of latencies across the system.
* Visualize **saturation** and understand the resources involved in your solution. What are the consequences if a resource gets depleted?
* **Traffic** measures the amount of use of your service per time unit.
  * The amount of users by time of day might be relevant for example
* **Healthchecks**:
  * Use Startup/Readiness/Liveness/ probes
    * Startup: For containers that take a long time to start
    * Readiness: db down, mandatory auth down, app can only serve X max connection
      * In many cases, a `readinessProbe` is all we need
    * Liveness: Deadlocks, internal corruption, etc.
      * *This container is dead, we don't know how to fix it*
    * Types of probes:
      * httpGet, exec, tcpSocket, grpc, etc.
      * All probes return a binary result (success/fail)
  * You can add a “ping” endpoint for an HTTP service that returns a 200 when called.
* **Infrastructure Metrics** (CPU, Memory, IO, Network)
* **Business Metrics**: Things like the number of completed orders

#### **Tools**

* **Prometheus** is an open-source monitoring system including:
  * multiple *service discovery* backends to figure out which metrics to collect
  * a *scraper* to collect these metrics
  * an efficient *time series database* to store these metrics
  * a specific query language (PromQL) to query these time series
  * an *alert manager* to notify us according to metrics values or trends
* Grafana for visualization.
* Datadog/CloudWatch for cloud-native monitoring.

**Key Metrics**:

* SLIs: Latency, Traffic, Error rates, Saturation.
* SLOs: Define acceptable thresholds (e.g., 99.9% uptime).

  * Identify what users care about most when interacting with the service. Common factors include:
    * **Availability**: Users expect services to be available when needed.
    * **Latency**: Users want fast responses.
    * **Correctness**: Users expect the service to function as intended without errors.

  <div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FWwMPnLKNdG5SNCngurBG%2Fimage.png?alt=media&#x26;token=776f9d60-6375-4c6c-a0d0-f5c0cf814bc7" alt="" width="375"><figcaption></figcaption></figure></div>
* SLAs: Contracts outlining the consequences if SLOs aren’t met
* **Distributed Tracing**:
  * Tools: Jaeger, OpenTelemetry.
  * Trace requests across microservices to identify bottlenecks.

### Alerting

* **Strategies**:
  * Use anomaly detection for proactive alerts.
  * Tiered alerts to reduce noise (e.g., warnings vs. critical).
* **Tools**:
  * PagerDuty, Opsgenie for incident response.

***

## **Load Balancers**

Load Balancers are essential for horizontally scaling applications. They distribute incoming traffic across multiple backend resources.

<div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FsqM9c4YtpIRSK6As5IAX%2Fimage.png?alt=media&#x26;token=72e9740e-587f-47ed-bfaa-f941b1d12e55" alt="" width="375"><figcaption><p>Difference between Load Balancer and Reverse Proxy</p></figcaption></figure></div>

#### **Advantages of Load Balancing:**

* Load balancers minimize server response time and maximize throughput.
* Load balancer ensures high availability and reliability by sending requests only to online servers
* Load balancers do continuous health checks to monitor the server’s capability of handling the request.

**Types of Load Balancer Algorithms:**

<div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FWvKJU9Iee5vBqw25AZZE%2Fimage.png?alt=media&#x26;token=6cfa9aac-c2d4-4a5f-992c-7a8f42281c95" alt="" width="375"><figcaption></figcaption></figure></div>

<div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FpWiXWL7q0S1GbyvNbXRq%2Fimage.png?alt=media&#x26;token=239cc35e-248f-4105-afee-7919e361c186" alt="" width="375"><figcaption><p>Resource Based Load Balancer</p></figcaption></figure></div>

**Types of Load Balancers:**

[**https://www.geeksforgeeks.org/layer-4-load-balancing-vs-layer-7-load-balancing/?ref=next\_article**](https://www.geeksforgeeks.org/layer-4-load-balancing-vs-layer-7-load-balancing/?ref=next_article)

1. **Network Load Balancer (NLB):**
   * **Layer**: Layer 4 (Transport)
   * **Function**: Forwards incoming traffic to backend targets (e.g., EC2 instances, pods) without modifying the packet headers, preserving the source IP.
   * **Use Case**: Ideal for **TCP/UDP traffic**.
2. **Application Load Balancer (ALB):**
   * **Layer**: Layer 7 (Application)
   * **Function**: Can inspect HTTP/HTTPS requests, enabling routing based on URL paths, headers, and other attributes. Essentially, it acts as a reverse proxy. (HAProxy, NGINX)
   * **Use Case**: Best for **HTTP traffic**, with support for advanced routing rules.

***

#### **How This Applies to EKS (Kubernetes)**

In an EKS environment, AWS Load Balancers work with Kubernetes Services and Ingress objects to manage traffic routing.

* **Service of type LoadBalancer**: When you create a Service with this type in Kubernetes, the **AWS Load Balancer Controller** sets up an **NLB** to manage Layer 4 traffic.
* **Ingress Object**: If you create an Ingress object, the controller sets up an **ALB** to manage Layer 7 traffic, supporting features like path-based routing or hostname-based routing to specific services.

***

#### **Multi-Region Load Balancing**

1. **High Availability**: Multi-region deployments ensure that your application can continue operating even if one region faces issues (e.g., outages, natural disasters).
2. **Latency Optimization**: Routing traffic to the closest region reduces latency, improving performance.
3. **Route 53 Latency-Based Routing**: AWS Route 53 can route traffic to the region with the lowest latency for users, enhancing the user experience.

***

#### **Reliability and Failover Mechanisms**

1. **Load Balancer Performance**: Access logs provide insights into traffic distribution, latency, and errors. This data is crucial for monitoring performance and troubleshooting.
2. **Security and Compliance**: Logs help in detecting security threats (e.g., DDoS attacks) and ensuring compliance with standards like PCI-DSS or HIPAA.
3. **Health Checks**: Both ALB and NLB use health checks to verify if a target (pod, EC2 instance) is healthy. Unhealthy targets are automatically excluded from traffic routing, ensuring reliability.
4. **Auto Scaling**: Kubernetes and AWS auto scaling mechanisms add more instances or pods when traffic increases, maintaining application performance.
5. **Cross-AZ Failover**: Multi-AZ deployments automatically reroute traffic to healthy resources in other Availability Zones if one AZ becomes unavailable, ensuring high availability.

## Databases

### **Performance Optimization**

* **Slow Queries**
  * Use query analyzers to identify slow queries.
  * Add custom tags or identifiers in SQL queries for tracking application-level operations.
  * Use indexes, avoid full table scans, and rewrite expensive queries.
* **Key Metrics to Monitor**
  * **CPU**: Track overall CPU usage and per-query CPU utilization for performance insights.
  * **IOPS**: Monitor read/write to evaluate workload distribution and storage performance.
    * **Optimize Data Access**: Reduce unnecessary reads/writes by denormalizing data, partitioning tables, or archiving unused data.
    * **Increase IOPS Capacity**: Use faster storage (e.g., SSDs over HDDs) or provision higher IOPS for better performance.
    * **Buffering/Batching**: Group writes into larger, less frequent operations to reduce I/O operations
* **Scaling**:
  * **Vertical Scaling**: Increase instance size for better resource allocation.
  * **Horizontal Scaling**: Add replicas or shards for improved database distribution.
  * **Caching**: Use Redis or Memcached to offload repeated queries and reduce load on the database.
  * **Change Data Capture (CDC):** CDC enables efficient synchronization of data between systems without requiring heavy full-database reads or disruptive batch updates.
    * Example of using CDC for a search service that uses ElasticSearch

      <div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FnCym8ddMl4iyjLGfykBI%2Fimage.png?alt=media&#x26;token=5b62a604-5010-4b79-86fb-8497399ff9df" alt="" width="375"><figcaption></figcaption></figure></div>

### **Reliability and Availability Enhancements**

* **Database Replication (Master-Slave Structure)**
  * Pros: Improves performance, reliability and availability]
  * Cons: Small delay to replicate data in slave DBs

    <div align="left"><figure><img src="https://1588585907-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MTwgToRvLjYdjfpAVgP%2Fuploads%2FMDHmxBsiZxZMDu2RDP9r%2Fimage.png?alt=media&#x26;token=97beedc7-f3d3-4dc3-9a48-5fdf62d42e1f" alt="" width="375"><figcaption><p>Database Replication (Master-Slave)</p></figcaption></figure></div>
  * **Master Database**: Handles all write operations and propagates changes to slaves.
  * **Slave Databases**: Handle read operations, helping distribute the load and improve reliability.

***

## Common Interview Scenarios and Responses

1. **Multi-Region Deployment**:
   * Use DNS-based routing (e.g., Route 53) for global traffic management.
   * Deploy services in active-active configuration across regions.
   * Synchronize data with replication (e.g., Kafka, database replicas).
2. **Database Scaling**:
   * Vertical scaling: Add more resources (CPU/RAM) temporarily.
   * Horizontal scaling: Add read replicas for increased throughput.
   * Monitor replica lag and implement query routing strategies.
3. **Logging and Monitoring**:
   * Centralize logs using ELK.
   * Set up Prometheus alerts for latency spikes.
   * Use Jaeger to trace high-latency requests.

## Troubleshooting Bottlenecks

### **Scenario**: High Latency in API Requests

* Check logs for errors or slow endpoints.
* Use APM to profile services.
* Add caching or optimize database queries.

### **Scenario**: Replica Lag in Database

* Analyze replication metrics.
* Offload read-heavy queries to replicas.
* Increase IOPS capacity if disk throughput is limited.

### **Scenario**: Traffic Spike in a Region

* Autoscale services.
* Redirect traffic using load balancers.
* Enable rate limiting to prevent overload.
