Scaling Reliably
Last updated
Last updated
How do we know if a system is scaling reliably? A system is considered reliable if it can consistently perform and meet requirements despite changes in its environment over time. Achieving this requires the system to detect failures, self-heal automatically, and scale according to demand.
The CAP theorem is a conclusion made by computer scientist Eric Brewer. It's a popular and fairly useful way to think about tradeoffs in the guarantees that a system design makes.
Consistency: all nodes see the same data at the same time.
Availability: node failures do not prevent survivors from continuing to operate.
Partition tolerance: the system continues to operate despite message loss due to network and/or node failure
Use to define infrastructure as code (IaC).
Terragrunt to keep things DRY (Don't repeat yourself).
Example: Multi-region deployments with provider configurations (using alias).
Automate configuration and ensure consistency across environments.
Version control for infrastructure and configuration files.
Example: Gitops CI/CD (merge to main deploys to one environment, git tags deploy to another, etc.)
User Segmentation: Target specific users or user segments based on attributes such as location, role, or subscription tier.
Custom Conditions: Use logic to roll out the feature only to users who meet specific criteria.
Percentage Rollouts: Gradually enable the feature for a small percentage of users and incrementally increase the percentage.
Use cluster federation for workload distribution (KubeFed is the official tool).
What: Cluster federation is a method in Kubernetes to combine multiple clusters across different regions into a single super-cluster that can be controlled through a single interface.
The Kubernetes Control Plane handles apps across a group of worker nodes. The Federated Control Plane does the same thing, but it does it across a group of clusters instead of nodes.
Why: Allows you to schedule workloads (apps/pods) dynamically across clusters based on factors like geographic location, resource availability, or latency requirements.
Aims to provide high availability
Examples: If you have users in North America and Europe, federation can ensure workloads are distributed to clusters closest to users, reducing latency.
Namespace: Creates a namespace with the same name in all clusters.
ConfigMap: Replicated in all clusters within the same namespace.
Deployments/ReplicaSets: Adjusted to distribute replicas fairly across clusters (can be customized).
A federated Deployment with 10 replicas spreads these 10 pods fairly across clusters (not 10 pods per cluster).
Ingress: Creates a unified access point across clusters, not individual Ingress objects.
Set up regional failover with DNS routing (e.g., Route 53).
This method helps distribute replicas across multiple nodes and availability zones (AZs) for better fault tolerance and availability.
Anti-Affinity: Add podAntiAffinity
to a Deployment resource to avoid placing replicas on the same node or availability zone.
Weight: weight: 100
increases the preference for spreading across AZs.
Partitioning:
Distribute partitions across brokers for load balancing.
Replication:
Configure replication factor for fault tolerance.
Multi-region replication for disaster recovery.
Producers/Consumers:
Optimize producer ACK settings for latency vs. durability.
Monitor consumer lag using Kafka Connect.
Centralized Systems:
ELK Stack (Elasticsearch, Logstash, Kibana).
Splunk for large-scale log management.
Structured Logging:
Use JSON formatting for better parsing.
Include trace IDs for distributed tracing.
Retention Policies:
Configure log rotation and archive policies to manage costs.
Four Golden Signals: Latency, Traffic, Error rates, Saturation.
Errors golden signal measures the rate of requests that fail.
Usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.
Latency is defined as the time it takes to serve a request.
Use Histograms (Ex: P99)
Talk by Gil Tene: https://youtu.be/lJ8ydIuPFeU?t=711
"Typical user sessions involve 5 page loads with over 40 resources per page"
Given this, some percentiles like P95 might be useless to monitor
Instead of focusing solely on a few percentiles (e.g., P50, P90, P99), use HDR histograms (High Dynamic Range histograms) to capture the full distribution of latencies. These histograms can show the range of latencies across the system.
Visualize saturation and understand the resources involved in your solution. What are the consequences if a resource gets depleted?
Traffic measures the amount of use of your service per time unit.
The amount of users by time of day might be relevant for example
Healthchecks:
Use Startup/Readiness/Liveness/ probes
Startup: For containers that take a long time to start
Readiness: db down, mandatory auth down, app can only serve X max connection
In many cases, a readinessProbe
is all we need
Liveness: Deadlocks, internal corruption, etc.
This container is dead, we don't know how to fix it
Types of probes:
httpGet, exec, tcpSocket, grpc, etc.
All probes return a binary result (success/fail)
You can add a “ping” endpoint for an HTTP service that returns a 200 when called.
Infrastructure Metrics (CPU, Memory, IO, Network)
Business Metrics: Things like the number of completed orders
Prometheus is an open-source monitoring system including:
multiple service discovery backends to figure out which metrics to collect
a scraper to collect these metrics
an efficient time series database to store these metrics
a specific query language (PromQL) to query these time series
an alert manager to notify us according to metrics values or trends
Grafana for visualization.
Datadog/CloudWatch for cloud-native monitoring.
Key Metrics:
SLIs: Latency, Traffic, Error rates, Saturation.
SLOs: Define acceptable thresholds (e.g., 99.9% uptime).
Identify what users care about most when interacting with the service. Common factors include:
Availability: Users expect services to be available when needed.
Latency: Users want fast responses.
Correctness: Users expect the service to function as intended without errors.
SLAs: Contracts outlining the consequences if SLOs aren’t met
Distributed Tracing:
Tools: Jaeger, OpenTelemetry.
Trace requests across microservices to identify bottlenecks.
Strategies:
Use anomaly detection for proactive alerts.
Tiered alerts to reduce noise (e.g., warnings vs. critical).
Tools:
PagerDuty, Opsgenie for incident response.
Load Balancers are essential for horizontally scaling applications. They distribute incoming traffic across multiple backend resources.
Load balancers minimize server response time and maximize throughput.
Load balancer ensures high availability and reliability by sending requests only to online servers
Load balancers do continuous health checks to monitor the server’s capability of handling the request.
Types of Load Balancer Algorithms:
Types of Load Balancers:
https://www.geeksforgeeks.org/layer-4-load-balancing-vs-layer-7-load-balancing/?ref=next_article
Network Load Balancer (NLB):
Layer: Layer 4 (Transport)
Function: Forwards incoming traffic to backend targets (e.g., EC2 instances, pods) without modifying the packet headers, preserving the source IP.
Use Case: Ideal for TCP/UDP traffic.
Application Load Balancer (ALB):
Layer: Layer 7 (Application)
Function: Can inspect HTTP/HTTPS requests, enabling routing based on URL paths, headers, and other attributes. Essentially, it acts as a reverse proxy. (HAProxy, NGINX)
Use Case: Best for HTTP traffic, with support for advanced routing rules.
In an EKS environment, AWS Load Balancers work with Kubernetes Services and Ingress objects to manage traffic routing.
Service of type LoadBalancer: When you create a Service with this type in Kubernetes, the AWS Load Balancer Controller sets up an NLB to manage Layer 4 traffic.
Ingress Object: If you create an Ingress object, the controller sets up an ALB to manage Layer 7 traffic, supporting features like path-based routing or hostname-based routing to specific services.
High Availability: Multi-region deployments ensure that your application can continue operating even if one region faces issues (e.g., outages, natural disasters).
Latency Optimization: Routing traffic to the closest region reduces latency, improving performance.
Route 53 Latency-Based Routing: AWS Route 53 can route traffic to the region with the lowest latency for users, enhancing the user experience.
Load Balancer Performance: Access logs provide insights into traffic distribution, latency, and errors. This data is crucial for monitoring performance and troubleshooting.
Security and Compliance: Logs help in detecting security threats (e.g., DDoS attacks) and ensuring compliance with standards like PCI-DSS or HIPAA.
Health Checks: Both ALB and NLB use health checks to verify if a target (pod, EC2 instance) is healthy. Unhealthy targets are automatically excluded from traffic routing, ensuring reliability.
Auto Scaling: Kubernetes and AWS auto scaling mechanisms add more instances or pods when traffic increases, maintaining application performance.
Cross-AZ Failover: Multi-AZ deployments automatically reroute traffic to healthy resources in other Availability Zones if one AZ becomes unavailable, ensuring high availability.
Slow Queries
Use query analyzers to identify slow queries.
Add custom tags or identifiers in SQL queries for tracking application-level operations.
Use indexes, avoid full table scans, and rewrite expensive queries.
Key Metrics to Monitor
CPU: Track overall CPU usage and per-query CPU utilization for performance insights.
IOPS: Monitor read/write to evaluate workload distribution and storage performance.
Optimize Data Access: Reduce unnecessary reads/writes by denormalizing data, partitioning tables, or archiving unused data.
Increase IOPS Capacity: Use faster storage (e.g., SSDs over HDDs) or provision higher IOPS for better performance.
Buffering/Batching: Group writes into larger, less frequent operations to reduce I/O operations
Scaling:
Vertical Scaling: Increase instance size for better resource allocation.
Horizontal Scaling: Add replicas or shards for improved database distribution.
Caching: Use Redis or Memcached to offload repeated queries and reduce load on the database.
Change Data Capture (CDC): CDC enables efficient synchronization of data between systems without requiring heavy full-database reads or disruptive batch updates.
Example of using CDC for a search service that uses ElasticSearch
Database Replication (Master-Slave Structure)
Pros: Improves performance, reliability and availability]
Cons: Small delay to replicate data in slave DBs
Master Database: Handles all write operations and propagates changes to slaves.
Slave Databases: Handle read operations, helping distribute the load and improve reliability.
Multi-Region Deployment:
Use DNS-based routing (e.g., Route 53) for global traffic management.
Deploy services in active-active configuration across regions.
Synchronize data with replication (e.g., Kafka, database replicas).
Database Scaling:
Vertical scaling: Add more resources (CPU/RAM) temporarily.
Horizontal scaling: Add read replicas for increased throughput.
Monitor replica lag and implement query routing strategies.
Logging and Monitoring:
Centralize logs using ELK.
Set up Prometheus alerts for latency spikes.
Use Jaeger to trace high-latency requests.
Check logs for errors or slow endpoints.
Use APM to profile services.
Add caching or optimize database queries.
Analyze replication metrics.
Offload read-heavy queries to replicas.
Increase IOPS capacity if disk throughput is limited.
Autoscale services.
Redirect traffic using load balancers.
Enable rate limiting to prevent overload.