👨‍💻
Mike's Notes
  • Introduction
  • MacOs Setup
    • System Preferences
    • Homebrew
      • Usage
    • iTerm
      • VIM
      • Tree
      • ZSH
    • Visual Studio Code
    • Git
    • SSH Keys
  • DevOps Knowledge
    • SRE
      • Scaling Reliably
        • Splitting a Monolith into Microservices
      • Troubleshooting Common Issues
      • Service Level Terminology
      • Toil
      • Monitoring
      • Release Engineering
      • Best Practices
      • On-Call
      • Alerting
    • Containers
      • Docker
        • Best Practices
          • Image Building
          • Docker Development
        • CLI Cheat Sheet
      • Container Orchestration
        • Kubernetes
          • Benefits
          • Cheat Sheet
          • Components
          • Pods
          • Workload Resources
          • Best Practices
    • Developer Portal 👨‍💻
      • Solution Overview 🎯
      • System Architecture 🏗️
      • Implementation Journey 🛠️
      • Cross-team Collaboration 🤝
      • Lessons & Future 🎓
    • Provisioning
      • Terraform
        • Installation
        • Usage
    • Configuration Management
      • Ansible
        • Benefits
        • Installation
    • Build Systems
      • Bazel
        • Features
  • Security
    • Secure Software Engineering
    • Core Concepts
    • Security Design Principles
    • Software Security Requirements
    • Compliance Standards and Policies
      • Sarbanes-Oxley (SOX)
      • HIPAA and HITECH
      • Payment Card Industry Data Security Standard (PCI-DSS)
      • General Data Protection Regulation (GDPR)
      • California Consumer Privacy Act (CCPA)
      • Federal Risk and Authorization Management Program (FedRAMP)
    • Privacy & Data
  • Linux Fundamentals
    • Introduction to Linux
    • Architecture
    • Server Administration
      • User / Groups
      • File Permissions
      • SSH
      • Process Management
    • Networking
      • Diagrams
      • Browser URL Example
      • Network Topologies
      • Signal Routing
      • DNS (Domain Name System)
      • SSL (Secure Sockets Layer)
      • TLS (Transport Layer Security)
  • System Design
    • Process
    • Kafka
      • Advanced Topics
    • URL Shortener
Powered by GitBook
On this page
  • Configuration Management
  • Terraform:
  • Ansible/Chef/Puppet:
  • Git:
  • LaunchDarkly:
  • Kubernetes (K8s)
  • Multi-region Clusters:
  • Scheduling Pods Across Nodes and AZs
  • Observability
  • Logging
  • Monitoring
  • Alerting
  • Load Balancers
  • Databases
  • Performance Optimization
  • Reliability and Availability Enhancements
  • Common Interview Scenarios and Responses
  • Troubleshooting Bottlenecks
  • Scenario: High Latency in API Requests
  • Scenario: Replica Lag in Database
  • Scenario: Traffic Spike in a Region

Was this helpful?

  1. DevOps Knowledge
  2. SRE

Scaling Reliably

PreviousSRENextSplitting a Monolith into Microservices

Last updated 5 months ago

Was this helpful?

How do we know if a system is scaling reliably? A system is considered reliable if it can consistently perform and meet requirements despite changes in its environment over time. Achieving this requires the system to detect failures, self-heal automatically, and scale according to demand.

The CAP theorem is a conclusion made by computer scientist Eric Brewer. It's a popular and fairly useful way to think about tradeoffs in the guarantees that a system design makes.

  • Consistency: all nodes see the same data at the same time.

  • Availability: node failures do not prevent survivors from continuing to operate.

  • Partition tolerance: the system continues to operate despite message loss due to network and/or node failure

Configuration Management

Terraform:

  • Use to define infrastructure as code (IaC).

  • Terragrunt to keep things DRY (Don't repeat yourself).

  • # The default provider configuration; resources that begin with `aws_` will use
    # it as the default, and it can be referenced as `aws`.
    provider "aws" {
      region = "us-east-1"
    }
    
    # Additional provider configuration for west coast region; resources can
    # reference this as `aws.west`.
    provider "aws" {
      alias  = "west"
      region = "us-west-2"
    }

Ansible/Chef/Puppet:

  • Automate configuration and ensure consistency across environments.

Git:

  • Version control for infrastructure and configuration files.

  • Example: Gitops CI/CD (merge to main deploys to one environment, git tags deploy to another, etc.)

LaunchDarkly:

  • User Segmentation: Target specific users or user segments based on attributes such as location, role, or subscription tier.

  • Custom Conditions: Use logic to roll out the feature only to users who meet specific criteria.

  • Percentage Rollouts: Gradually enable the feature for a small percentage of users and incrementally increase the percentage.


Kubernetes (K8s)

Multi-region Clusters:

  • Use cluster federation for workload distribution (KubeFed is the official tool).

  • What: Cluster federation is a method in Kubernetes to combine multiple clusters across different regions into a single super-cluster that can be controlled through a single interface.

    • The Kubernetes Control Plane handles apps across a group of worker nodes. The Federated Control Plane does the same thing, but it does it across a group of clusters instead of nodes.

  • Why: Allows you to schedule workloads (apps/pods) dynamically across clusters based on factors like geographic location, resource availability, or latency requirements.

    • Aims to provide high availability

  • Examples: If you have users in North America and Europe, federation can ensure workloads are distributed to clusters closest to users, reducing latency.

    • Namespace: Creates a namespace with the same name in all clusters.

    • ConfigMap: Replicated in all clusters within the same namespace.

    • Deployments/ReplicaSets: Adjusted to distribute replicas fairly across clusters (can be customized).

      • A federated Deployment with 10 replicas spreads these 10 pods fairly across clusters (not 10 pods per cluster).

    • Ingress: Creates a unified access point across clusters, not individual Ingress objects.

  • Set up regional failover with DNS routing (e.g., Route 53).

Scheduling Pods Across Nodes and AZs

  • This method helps distribute replicas across multiple nodes and availability zones (AZs) for better fault tolerance and availability.

  • Anti-Affinity: Add podAntiAffinity to a Deployment resource to avoid placing replicas on the same node or availability zone.

  • Weight: weight: 100 increases the preference for spreading across AZs.

Kafka-Specific Deployments

  • Partitioning:

    • Distribute partitions across brokers for load balancing.

  • Replication:

    • Configure replication factor for fault tolerance.

    • Multi-region replication for disaster recovery.

  • Producers/Consumers:

    • Optimize producer ACK settings for latency vs. durability.

    • Monitor consumer lag using Kafka Connect.


Observability

Logging

  • Centralized Systems:

    • ELK Stack (Elasticsearch, Logstash, Kibana).

    • Splunk for large-scale log management.

  • Structured Logging:

    • Use JSON formatting for better parsing.

    • Include trace IDs for distributed tracing.

  • Retention Policies:

    • Configure log rotation and archive policies to manage costs.

Monitoring

What to monitor

  • Four Golden Signals: Latency, Traffic, Error rates, Saturation.

  • Errors golden signal measures the rate of requests that fail.

    • Usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.

  • Latency is defined as the time it takes to serve a request.

    • Use Histograms (Ex: P99)

      • "Typical user sessions involve 5 page loads with over 40 resources per page"

        • Given this, some percentiles like P95 might be useless to monitor

        • Instead of focusing solely on a few percentiles (e.g., P50, P90, P99), use HDR histograms (High Dynamic Range histograms) to capture the full distribution of latencies. These histograms can show the range of latencies across the system.

  • Visualize saturation and understand the resources involved in your solution. What are the consequences if a resource gets depleted?

  • Traffic measures the amount of use of your service per time unit.

    • The amount of users by time of day might be relevant for example

  • Healthchecks:

    • Use Startup/Readiness/Liveness/ probes

      • Startup: For containers that take a long time to start

      • Readiness: db down, mandatory auth down, app can only serve X max connection

        • In many cases, a readinessProbe is all we need

      • Liveness: Deadlocks, internal corruption, etc.

        • This container is dead, we don't know how to fix it

      • Types of probes:

        • httpGet, exec, tcpSocket, grpc, etc.

        • All probes return a binary result (success/fail)

    • You can add a “ping” endpoint for an HTTP service that returns a 200 when called.

  • Infrastructure Metrics (CPU, Memory, IO, Network)

  • Business Metrics: Things like the number of completed orders

Tools

  • Prometheus is an open-source monitoring system including:

    • multiple service discovery backends to figure out which metrics to collect

    • a scraper to collect these metrics

    • an efficient time series database to store these metrics

    • a specific query language (PromQL) to query these time series

    • an alert manager to notify us according to metrics values or trends

  • Grafana for visualization.

  • Datadog/CloudWatch for cloud-native monitoring.

Key Metrics:

  • SLIs: Latency, Traffic, Error rates, Saturation.

  • SLOs: Define acceptable thresholds (e.g., 99.9% uptime).

    • Identify what users care about most when interacting with the service. Common factors include:

      • Availability: Users expect services to be available when needed.

      • Latency: Users want fast responses.

      • Correctness: Users expect the service to function as intended without errors.

  • SLAs: Contracts outlining the consequences if SLOs aren’t met

  • Distributed Tracing:

    • Tools: Jaeger, OpenTelemetry.

    • Trace requests across microservices to identify bottlenecks.

Alerting

  • Strategies:

    • Use anomaly detection for proactive alerts.

    • Tiered alerts to reduce noise (e.g., warnings vs. critical).

  • Tools:

    • PagerDuty, Opsgenie for incident response.


Load Balancers

Load Balancers are essential for horizontally scaling applications. They distribute incoming traffic across multiple backend resources.

Advantages of Load Balancing:

  • Load balancers minimize server response time and maximize throughput.

  • Load balancer ensures high availability and reliability by sending requests only to online servers

  • Load balancers do continuous health checks to monitor the server’s capability of handling the request.

Types of Load Balancer Algorithms:

Types of Load Balancers:

  1. Network Load Balancer (NLB):

    • Layer: Layer 4 (Transport)

    • Function: Forwards incoming traffic to backend targets (e.g., EC2 instances, pods) without modifying the packet headers, preserving the source IP.

    • Use Case: Ideal for TCP/UDP traffic.

  2. Application Load Balancer (ALB):

    • Layer: Layer 7 (Application)

    • Function: Can inspect HTTP/HTTPS requests, enabling routing based on URL paths, headers, and other attributes. Essentially, it acts as a reverse proxy. (HAProxy, NGINX)

    • Use Case: Best for HTTP traffic, with support for advanced routing rules.


How This Applies to EKS (Kubernetes)

In an EKS environment, AWS Load Balancers work with Kubernetes Services and Ingress objects to manage traffic routing.

  • Service of type LoadBalancer: When you create a Service with this type in Kubernetes, the AWS Load Balancer Controller sets up an NLB to manage Layer 4 traffic.

  • Ingress Object: If you create an Ingress object, the controller sets up an ALB to manage Layer 7 traffic, supporting features like path-based routing or hostname-based routing to specific services.


Multi-Region Load Balancing

  1. High Availability: Multi-region deployments ensure that your application can continue operating even if one region faces issues (e.g., outages, natural disasters).

  2. Latency Optimization: Routing traffic to the closest region reduces latency, improving performance.

  3. Route 53 Latency-Based Routing: AWS Route 53 can route traffic to the region with the lowest latency for users, enhancing the user experience.


Reliability and Failover Mechanisms

  1. Load Balancer Performance: Access logs provide insights into traffic distribution, latency, and errors. This data is crucial for monitoring performance and troubleshooting.

  2. Security and Compliance: Logs help in detecting security threats (e.g., DDoS attacks) and ensuring compliance with standards like PCI-DSS or HIPAA.

  3. Health Checks: Both ALB and NLB use health checks to verify if a target (pod, EC2 instance) is healthy. Unhealthy targets are automatically excluded from traffic routing, ensuring reliability.

  4. Auto Scaling: Kubernetes and AWS auto scaling mechanisms add more instances or pods when traffic increases, maintaining application performance.

  5. Cross-AZ Failover: Multi-AZ deployments automatically reroute traffic to healthy resources in other Availability Zones if one AZ becomes unavailable, ensuring high availability.

Databases

Performance Optimization

  • Slow Queries

    • Use query analyzers to identify slow queries.

    • Add custom tags or identifiers in SQL queries for tracking application-level operations.

    • Use indexes, avoid full table scans, and rewrite expensive queries.

  • Key Metrics to Monitor

    • CPU: Track overall CPU usage and per-query CPU utilization for performance insights.

    • IOPS: Monitor read/write to evaluate workload distribution and storage performance.

      • Optimize Data Access: Reduce unnecessary reads/writes by denormalizing data, partitioning tables, or archiving unused data.

      • Increase IOPS Capacity: Use faster storage (e.g., SSDs over HDDs) or provision higher IOPS for better performance.

      • Buffering/Batching: Group writes into larger, less frequent operations to reduce I/O operations

  • Scaling:

    • Vertical Scaling: Increase instance size for better resource allocation.

    • Horizontal Scaling: Add replicas or shards for improved database distribution.

    • Caching: Use Redis or Memcached to offload repeated queries and reduce load on the database.

    • Change Data Capture (CDC): CDC enables efficient synchronization of data between systems without requiring heavy full-database reads or disruptive batch updates.

      • Example of using CDC for a search service that uses ElasticSearch

Reliability and Availability Enhancements

  • Database Replication (Master-Slave Structure)

    • Pros: Improves performance, reliability and availability]

    • Cons: Small delay to replicate data in slave DBs

    • Master Database: Handles all write operations and propagates changes to slaves.

    • Slave Databases: Handle read operations, helping distribute the load and improve reliability.


Common Interview Scenarios and Responses

  1. Multi-Region Deployment:

    • Use DNS-based routing (e.g., Route 53) for global traffic management.

    • Deploy services in active-active configuration across regions.

    • Synchronize data with replication (e.g., Kafka, database replicas).

  2. Database Scaling:

    • Vertical scaling: Add more resources (CPU/RAM) temporarily.

    • Horizontal scaling: Add read replicas for increased throughput.

    • Monitor replica lag and implement query routing strategies.

  3. Logging and Monitoring:

    • Centralize logs using ELK.

    • Set up Prometheus alerts for latency spikes.

    • Use Jaeger to trace high-latency requests.

Troubleshooting Bottlenecks

Scenario: High Latency in API Requests

  • Check logs for errors or slow endpoints.

  • Use APM to profile services.

  • Add caching or optimize database queries.

Scenario: Replica Lag in Database

  • Analyze replication metrics.

  • Offload read-heavy queries to replicas.

  • Increase IOPS capacity if disk throughput is limited.

Scenario: Traffic Spike in a Region

  • Autoscale services.

  • Redirect traffic using load balancers.

  • Enable rate limiting to prevent overload.

Example: (using alias).

Talk by Gil Tene:

Multi-region deployments with provider configurations
https://youtu.be/lJ8ydIuPFeU?t=711
https://www.geeksforgeeks.org/layer-4-load-balancing-vs-layer-7-load-balancing/?ref=next_article
CAP Theorem
Difference between Load Balancer and Reverse Proxy
Resource Based Load Balancer
Database Replication (Master-Slave)