19. Observability and Monitoring
19. Observability and Monitoring¶
Difficulty: ββββ
Overview¶
Observability is the ability to understand a system's internal state from its external outputs. In this lesson, we cover the three pillars of observability β metrics, logs, and traces β along with practical tools and frameworks for monitoring distributed systems at scale.
Table of Contents¶
- Observability Fundamentals
- Metrics and Time-Series Data
- Logging at Scale
- Distributed Tracing
- Alerting and SLOs
- OpenTelemetry
- Practice Problems
1. Observability Fundamentals¶
1.1 Three Pillars of Observability¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Three Pillars of Observability β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Metrics β β Logs β β Traces β β
β β β β β β β β
β β "What is β β "What β β "What path β β
β β happening?" β β happened?" β β did it β β
β β β β β β take?" β β
β ββββββββββββββββ€ ββββββββββββββββ€ ββββββββββββββββ€ β
β β Numeric β β Structured β β Request- β β
β β time-series β β events with β β scoped β β
β β data β β context β β causality β β
β ββββββββββββββββ€ ββββββββββββββββ€ ββββββββββββββββ€ β
β β Low cost β β Medium cost β β Higher cost β β
β β per signal β β per signal β β per signal β β
β ββββββββββββββββ€ ββββββββββββββββ€ ββββββββββββββββ€ β
β β Prometheus β β ELK Stack β β Jaeger β β
β β Grafana β β Loki β β Zipkin β β
β β Datadog β β Fluentd β β Tempo β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Correlation via Trace IDs and Labels β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Monitoring vs Observability¶
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β Monitoring β Observability β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ€
β Known unknowns β Unknown unknowns β
β "Is CPU > 90%?" β "Why is latency high?" β
β Dashboard-driven β Exploration-driven β
β Predefined alerts β Ad-hoc investigation β
β Reactive β Proactive β
β Works for simple systems β Essential for distributed systems β
βββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ
2. Metrics and Time-Series Data¶
2.1 Metric Types¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Four Golden Signals (Google SRE) β
β β
β 1. Latency β Time to serve a request β
β 2. Traffic β Demand on the system (RPS) β
β 3. Errors β Rate of failed requests β
β 4. Saturation β How "full" the system is β
β β
β RED Method (for microservices) β
β β
β 1. Rate β Requests per second β
β 2. Errors β Failed requests per second β
β 3. Duration β Distribution of request latencies β
β β
β USE Method (for infrastructure) β
β β
β 1. Utilization β % of resource busy β
β 2. Saturation β Queue depth, pending work β
β 3. Errors β Error count β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Prometheus Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Prometheus Ecosystem β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Application β β Application β β Node β β
β β /metrics β β /metrics β β Exporter β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β βββββββββββββ¬ββββββββββββββββββββββββββββ β
β β scrape (pull) β
β ββββββββΌβββββββ β
β β Prometheus β β
β β Server ββββββββΆ AlertManager βββΆ PagerDuty β
β β (TSDB) β β Slack β
β ββββββββ¬βββββββ β Email β
β β β
β ββββββββΌβββββββ β
β β Grafana β β
β β (Dashboard) β β
β βββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.3 Prometheus Metric Types¶
# Counter β monotonically increasing (requests, errors)
http_requests_total{method="GET", path="/api/users", status="200"} 12345
# Gauge β goes up and down (temperature, queue size)
process_memory_bytes 1073741824
# Histogram β samples in configurable buckets
http_request_duration_seconds_bucket{le="0.1"} 1000
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250
http_request_duration_seconds_count 1280
http_request_duration_seconds_sum 320.5
# Summary β similar to histogram with quantiles
http_request_duration_seconds{quantile="0.5"} 0.042
http_request_duration_seconds{quantile="0.9"} 0.087
http_request_duration_seconds{quantile="0.99"} 0.235
2.4 PromQL Queries¶
# Request rate over 5 minutes
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Top 5 endpoints by request rate
topk(5, sum by (path)(rate(http_requests_total[5m])))
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
# Predict disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
3. Logging at Scale¶
3.1 Structured Logging¶
{
"timestamp": "2026-02-15T10:30:00Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "789ghi",
"user_id": "u-42",
"message": "Failed to process order",
"error": "PaymentDeclined",
"order_id": "ord-789",
"amount": 129.99,
"duration_ms": 1250
}
3.2 ELK Stack Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELK Stack (Elastic Stack) β
β β
β Applications β
β βββββββββββ βββββββββββ βββββββββββ β
β β Service β β Service β β Service β β
β β A β β B β β C β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β ββββββββββββββΌβββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Filebeat / Fluentd β Log Shippers β
β ββββββββββββββββ¬ββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Logstash / Kafka β Processing / Buffer β
β β (parse, filter, enrich) β β
β ββββββββββββββββ¬ββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Elasticsearch β Storage & Search β
β β (index, full-text search) β β
β ββββββββββββββββ¬ββββββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββββββββββββββββββ β
β β Kibana β Visualization β
β β (dashboards, queries) β β
β ββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.3 Grafana Loki (Lightweight Alternative)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Grafana Loki Stack β
β β
β β’ Does NOT index log content (only labels) β
β β’ Much cheaper storage than Elasticsearch β
β β’ LogQL query language (similar to PromQL) β
β β’ Ideal for Kubernetes environments β
β β
β Promtail βββΆ Loki βββΆ Grafana β
β (agent) (store) (query/visualize) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# LogQL examples
{service="order-service"} |= "error" # contains "error"
{service="order-service"} | json | status >= 500 # JSON parsing + filter
{service="order-service"} | json | line_format "{{.message}}"
rate({service="order-service"} |= "error" [5m]) # error rate
3.4 Log Levels and Best Practices¶
Level β When to Use
ββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
TRACE β Very fine-grained (usually disabled)
DEBUG β Development troubleshooting
INFO β Normal operations, business events
WARN β Recoverable issues, degraded service
ERROR β Failures requiring attention
FATAL β Application cannot continue
Best Practices: - Use structured logging (JSON) over plain text - Include correlation IDs (trace_id) in every log - Log at appropriate levels (no INFO spam in production) - Set retention policies (7d hot, 30d warm, 90d cold) - Avoid logging sensitive data (PII, credentials)
4. Distributed Tracing¶
4.1 Trace Anatomy¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Distributed Trace Example β
β β
β Trace ID: abc-123-def β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Span A: API Gateway (200ms) β β
β β βββββββββββββββββββββββββββ β β
β β β Span B: Auth (30ms) β β β
β β βββββββββββββββββββββββββββ β β
β β ββββββββββββββββββββββββββββββββ β β
β β β Span C: Order Service (120ms)β β β
β β β βββββββββββββββββ β β β
β β β β Span D: DB β β β β
β β β β Query (15ms) β β β β
β β β βββββββββββββββββ β β β
β β β ββββββββββββββββ β β β
β β β β Span E: β β β β
β β β β Payment β β β β
β β β β (80ms) β β β β
β β β ββββββββββββββββ β β β
β β ββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 0ms 100ms 200ms β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.2 Tracing Systems¶
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tool β Description β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Jaeger β Open source, Uber-originated, CNCF project β
β Zipkin β Open source, Twitter-originated β
β Grafana Tempo β Cost-efficient, only stores trace IDs β
β AWS X-Ray β Managed service for AWS workloads β
β Datadog APM β Commercial, integrated with metrics/logs β
ββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββ
4.3 Context Propagation¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β W3C Trace Context Headers β
β β
β HTTP Request: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β traceparent: 00-{trace-id}-{span-id}-{flags} β β
β β tracestate: vendor1=value1,vendor2=value2 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Example: β
β traceparent: 00-abc123def456-789ghi012-01 β
β ββββββββ¬βββββββββββββββ¬βββββββββββββ¬βββββββ β
β β ver β trace-id β span-id βflags β β
β β 00 β abc123def456 β 789ghi012 β 01 β β
β ββββββββ΄βββββββββββββββ΄βββββββββββββ΄βββββββ β
β β
β Service A ββ(traceparent)βββΆ Service B ββ(traceparent)βββΆ C β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.4 Sampling Strategies¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Sampling Strategies β
β β
β 1. Head-based sampling β
β Decision at trace start: sample 10% of all requests β
β + Simple, low overhead β
β β May miss important traces β
β β
β 2. Tail-based sampling β
β Decision after trace completes: keep errors + slow traces β
β + Captures interesting traces β
β β Higher memory usage (buffer all spans) β
β β
β 3. Rate-limited sampling β
β Keep N traces per second per service β
β + Predictable cost β
β β May miss bursts β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. Alerting and SLOs¶
5.1 SLI / SLO / SLA¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SLI β SLO β SLA β
β β
β SLI (Service Level Indicator) β
β βββ What you measure β
β βββ "Proportion of requests < 200ms" β
β βββ "Proportion of requests returning 2xx" β
β β
β SLO (Service Level Objective) β
β βββ Internal target β
β βββ "99.9% of requests < 200ms over 30 days" β
β βββ "99.95% availability per month" β
β β
β SLA (Service Level Agreement) β
β βββ External contract with consequences β
β βββ "99.9% uptime or service credits issued" β
β β
β Rule: SLO should be stricter than SLA β
β (e.g., SLO = 99.95% when SLA = 99.9%) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.2 Error Budgets¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Error Budget Concept β
β β
β SLO = 99.9% availability β
β Error Budget = 100% - 99.9% = 0.1% β
β β
β Per 30 days: 0.1% Γ 30 Γ 24 Γ 60 = 43.2 minutes of downtime β
β β
β Error Budget Remaining: β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 70% remaining 30% consumed β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Policy: β
β β’ Budget > 50%: Deploy freely, experiment β
β β’ Budget 20-50%: Careful deployments, extra testing β
β β’ Budget < 20%: Freeze features, focus on reliability β
β β’ Budget = 0%: Emergency freeze until budget replenishes β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.3 Alerting Best Practices¶
# Prometheus alerting rules example
groups:
- name: slo-alerts
rules:
# Multi-window, multi-burn-rate alert
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > 14.4 * 0.001 # 14.4x burn rate for 5m window
AND
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate exceeds SLO burn rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 1s"
Alerting Anti-Patterns: - Alert fatigue: too many non-actionable alerts - Missing runbooks: alerts without remediation steps - No owner: alerts routed to "everyone" - Threshold-only: static thresholds without trend analysis
6. OpenTelemetry¶
6.1 OpenTelemetry Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenTelemetry (OTel) Architecture β
β β
β Application β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OTel SDK β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Traces β β Metrics β β Logs β β β
β β β API β β API β β API β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β β βββββββββββββββΌβββββββββββββ β β
β β βΌ β β
β β OTLP Exporter β β
β βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OTel Collector β β
β β ββββββββββ ββββββββββββββ ββββββββββββ β β
β β βReceiversβ β Processors β β Exportersβ β β
β β β OTLP βββ Batch βββ Jaeger β β β
β β β Zipkin β β Filter β β Prometheusβ β β
β β β Kafka β β Tail-sampleβ β Loki β β β
β β ββββββββββ ββββββββββββββ ββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.2 OTel Collector Configuration¶
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: 0.0.0.0:8889
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
6.3 Instrumentation Example (Python)¶
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Setup tracing
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Setup metrics
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
"http.requests", description="Total HTTP requests"
)
request_duration = meter.create_histogram(
"http.request.duration", description="Request duration in ms"
)
# Usage
@tracer.start_as_current_span("process_order")
def process_order(order_id):
request_counter.add(1, {"endpoint": "/orders", "method": "POST"})
with tracer.start_as_current_span("validate_order") as span:
span.set_attribute("order.id", order_id)
validate(order_id)
with tracer.start_as_current_span("charge_payment"):
charge(order_id)
7. Practice Problems¶
Problem 1: Design Monitoring for a Microservices Platform¶
You are designing the observability stack for an e-commerce platform with 20 microservices.
Key considerations: - What metrics would you collect from each service? - How would you correlate logs across services? - What sampling strategy for traces? - Define SLOs for the checkout service.
Example approach:
Metrics (RED for each service):
- Rate: http_requests_total by service, method, status
- Errors: http_requests_total{status=~"5.."} / total
- Duration: http_request_duration_seconds histogram
Logging:
- Structured JSON with trace_id in every log
- Centralized via Loki or Elasticsearch
- Retention: 7d hot, 30d warm, 90d archive
Tracing:
- Tail-based sampling: keep all errors + p99 latencies
- Head-based: 10% sample for normal traffic
- Jaeger or Tempo as backend
Checkout SLOs:
- Availability: 99.95% success rate (30-day window)
- Latency: p99 < 2s, p50 < 500ms
- Error budget: ~21.6 min/month
Problem 2: Alert Design¶
Design an alerting strategy that avoids alert fatigue.
Example answer:
Multi-burn-rate alerting:
- 2% budget consumed in 1 hour β page (critical)
- 5% budget consumed in 6 hours β page (warning)
- 10% budget consumed in 3 days β ticket (low)
Routing:
- Critical β PagerDuty β on-call engineer
- Warning β Slack #alerts β team lead
- Low β Jira ticket β backlog
Every alert must have:
- Runbook link
- Dashboard link
- Expected impact
- Suggested remediation
Problem 3: Observability Cost Optimization¶
Your team spends $50K/month on observability. Reduce costs by 40%.
Example answer:
1. Metrics: Drop unused metrics (audit dashboards)
- Reduce cardinality (fewer label values)
- Increase scrape intervals for non-critical services (15s β 60s)
2. Logs: Switch from ELK to Loki
- Stop indexing full log content
- Reduce log verbosity (DEBUG β INFO in production)
- Shorter retention (90d β 30d for non-regulated data)
3. Traces: Implement tail-based sampling
- Keep 100% of errors and slow traces
- Sample 1% of successful traces
- Use Grafana Tempo (cheaper than Jaeger at scale)
4. Architecture:
- Self-host OpenTelemetry Collector
- Use object storage (S3) for cold data
- Aggregate metrics at collector level