Lesson 13: DevOps and CI/CD

Lesson 13: DevOps and CI/CD

Previous: Process Improvement | Next: Technical Documentation


DevOps has transformed how software teams build and deliver software over the past decade. But DevOps is not a tool, a product, or even a job title — it is a culture and set of practices that breaks down the traditional wall between software development and IT operations. This lesson explores what DevOps truly means, the principles behind continuous integration and continuous delivery, and how mature engineering organizations measure and improve the speed and reliability of their delivery process.

Difficulty: ⭐⭐⭐

Prerequisites: - Configuration Management — branching strategies, version control workflows - Process Improvement — metrics, retrospectives, improvement cycles - Basic familiarity with build systems and automated testing concepts

Learning Objectives: - Explain what DevOps is and distinguish it from traditional operations and SRE - Describe the CALMS framework and why culture is the foundation of DevOps - Design a CI pipeline with meaningful stages and quality gates - Distinguish between Continuous Delivery and Continuous Deployment - Compare deployment strategies: rolling, blue-green, canary, and feature flags - Explain the three pillars of observability and why they matter - Define the four DORA metrics and interpret what they reveal about team health - Describe blameless postmortems and why psychological safety enables learning


1. What Is DevOps?

DevOps emerged around 2008–2009, born from frustration with the chronic dysfunction between development teams and operations teams. Developers were incentivized to ship new features quickly; operations teams were incentivized to keep production stable. These opposing incentives created a wall — features piled up waiting for deployment, incidents were blamed on "bad code" or "ops mistakes," and nobody owned the full delivery pipeline.

DevOps is the practice of development and operations engineers working together throughout the entire service lifecycle — from design and development through deployment and production support — applying the principles of lean manufacturing and systems thinking to software delivery.

The term was popularized by Patrick Debois and Andrew Shafer, and the "State of DevOps" research (starting at Puppet Labs in 2013, later continued by DORA) provided the first rigorous evidence that DevOps practices correlate with significantly better organizational outcomes.

DevOps vs Traditional Operations

Dimension Traditional DevOps
Team structure Dev and Ops separate silos Shared responsibility, cross-functional teams
Deployments Large batches, infrequent, high risk Small, frequent, incremental
Change approval Change Advisory Boards, weeks of lead time Automated gates, minutes to hours
Incident response Blame-finding, escalation Blameless postmortems, shared on-call
Feedback loop Bugs found after release, slow feedback Monitoring, alerts, fast recovery
Knowledge Ops "owns" production; dev doesn't have access Developers understand and can operate their services

DevOps vs SRE

Site Reliability Engineering (SRE) is Google's implementation of DevOps principles. Where DevOps describes a philosophy, SRE describes a job function and set of practices.

SRE introduces concrete mechanisms: - Service Level Objectives (SLOs): Quantified reliability targets (e.g., 99.9% availability) - Error budgets: The tolerable amount of downtime — once exhausted, feature work stops until reliability improves - Toil reduction: SREs are expected to automate away repetitive manual work - Postmortems: Structured learning from incidents

The relationship: "SRE is what you get when you ask a software engineer to design an operations function." DevOps is the broader cultural movement; SRE is one rigorous way to implement it.


2. The CALMS Framework

CALMS is the most widely used framework for understanding DevOps. Each letter represents a dimension of organizational capability.

C — Culture

Culture is the foundation. Without it, all the tooling in the world will fail. A DevOps culture means: - Shared ownership: Both developers and operations own reliability and deployment success - Blame-free environment: Failures are opportunities to improve systems, not to punish people - Transparency: Metrics, incidents, and decisions are visible across teams - Customer focus: The ultimate measure is value delivered to users

The culture shift is the hardest part. It requires leadership support, trust-building, and often organizational restructuring. Many "DevOps transformations" fail not because of tool selection but because the underlying culture did not change.

A — Automation

Automation eliminates manual, error-prone steps from the delivery pipeline: - Code builds and compilation - Test execution at all levels - Infrastructure provisioning - Deployment to all environments - Configuration management - Security scanning

The goal of automation is not just speed — it is consistency and reproducibility. A manual process that works 90% of the time is a reliability problem.

L — Lean

Lean principles from manufacturing apply directly to software delivery: - Eliminate waste: Unnecessary approval steps, waiting for environments, manual handoffs - Amplify feedback: Make problems visible quickly so they can be fixed - Small batch sizes: Deploy smaller changes more frequently to reduce risk and accelerate feedback - Flow: Optimize for the throughput of the whole system, not individual stages

The key insight from lean: a defect discovered after deployment costs ten times more to fix than one caught in code review, and a hundred times more than one prevented by good design.

M — Measurement

You cannot improve what you cannot measure. DevOps teams instrument everything: - Deployment frequency and lead time - System health: error rates, latency, saturation - Pipeline health: build success rate, test coverage trends - Business outcomes: conversion rates, user engagement

Measurement requires tooling (monitoring, APM, logging), but more importantly it requires a culture that uses data to make decisions and learns from failure data rather than hiding it.

S — Sharing

Knowledge silos are the enemy of DevOps. Sharing means: - Open postmortems visible to the whole organization - Runbooks and playbooks accessible to everyone - Architecture Decision Records explaining why systems are built a certain way - Internal tech talks, communities of practice, and mentoring - Contributing back to open source when possible


3. Continuous Integration

Continuous Integration (CI) is the practice of integrating code changes into a shared repository frequently — ideally multiple times per day — and verifying each integration with an automated build and test suite.

The Problem CI Solves

Before CI, it was common for teams to work on separate branches for weeks or months, then attempt to merge everything together before a release. The result was integration hell: massive conflicts, broken tests, and days of debugging. Martin Fowler and Kent Beck popularized CI as part of Extreme Programming to address this directly.

CI Principles

  1. Maintain a single source repository: All code lives in one main branch (or is merged to it frequently)
  2. Automate the build: Anyone should be able to check out the code and build it with a single command
  3. Make the build self-testing: The build runs tests; a passing build means the code is correct
  4. Every commit triggers a build: The pipeline runs automatically on every push
  5. Fix broken builds immediately: A broken main branch is the highest priority — not "I'll fix it later"
  6. Keep the build fast: A CI pipeline that takes 45 minutes provides poor feedback — aim for under 10 minutes for the core pipeline
  7. Test in a production-like environment: Differences between test and production environments introduce risk
  8. Make results visible: A dashboard showing build status is a team health indicator

CI Pipeline Stages

A typical CI pipeline progresses through stages of increasing depth and cost:

[Commit]
   
   
[Checkout & Compile]          ~30 seconds
   │  Syntax errors, missing dependencies
   
[Unit Tests]                  ~2–5 minutes
   │  Fast, isolated, high coverage
   
[Static Analysis & Linting]   ~1–2 minutes
   │  Code style, complexity, security patterns (SAST)
   
[Integration Tests]           ~5–10 minutes
   │  Tests across component boundaries, database interactions
   
[Build Artifact]              ~1–3 minutes
   │  Docker image, JAR, binary, package
   
[Security Scanning]           ~2–5 minutes
   │  Dependency vulnerabilities (SCA), container image scanning
   
[PASS] → Artifact stored in registry
[FAIL] → Build blocked, developer notified

Quality gates: Each stage acts as a gate. A stage failure stops the pipeline and prevents a broken artifact from progressing. This "fail fast" principle is essential — the earlier a problem is caught, the cheaper it is to fix.

Test Pyramid in CI

The test pyramid (Mike Cohn) guides what types of tests to run and when:

        /\
       /  \
      / E2E \         Few, slow, brittle — run less frequently
     /--------\
    / Integration\    Moderate — run on every commit
   /--------------\
  /   Unit Tests   \  Many, fast, reliable — run on every commit
 /------------------\

Unit tests should dominate the pyramid. They run in milliseconds, have no external dependencies, and give precise feedback. A pipeline with too many slow integration tests will be ignored because feedback takes too long.


4. Continuous Delivery vs Continuous Deployment

These terms are related but distinct, and the distinction matters.

Continuous Delivery

Definition: Every successful build produces an artifact that could be deployed to production at any time. Deployment to production is a manual decision.

Continuous Delivery guarantees that: - The software is always in a deployable state - Deployment is a low-risk, routine operation - The business can choose when to release (not just if the release is technically possible)

This decoupling of "technically releasable" from "business decision to release" is powerful. A team can complete a feature that's not ready for users, keep the code merged to main, and release when the marketing team is ready.

Continuous Deployment

Definition: Every successful build is automatically deployed to production without human approval.

Continuous Deployment requires: - Exceptionally high test coverage and pipeline confidence - Robust monitoring and alerting to catch problems post-deployment - Fast rollback capability (automated or one-click) - Feature flags to decouple code deployment from feature visibility

Continuous Delivery Continuous Deployment
Deployment to production Manual trigger Automatic
Human approval step Yes (at the end) No
Risk tolerance Moderate Requires high confidence
Typical use case Enterprise, regulated industries High-velocity web services
Examples Netflix (mostly), banks Flickr (10+ deploys/day in 2009), many SaaS companies

Most organizations practice Continuous Delivery; Continuous Deployment is appropriate for teams with mature CI and strong monitoring culture.


5. Deployment Strategies

How you release changes is as important as whether you release them. Different strategies manage risk differently.

Rolling Deployment

Update instances one at a time (or in small batches). At any moment, some instances run the old version and some run the new.

Before:  [v1] [v1] [v1] [v1] [v1]
Step 1:  [v2] [v1] [v1] [v1] [v1]
Step 2:  [v2] [v2] [v1] [v1] [v1]
Step 3:  [v2] [v2] [v2] [v1] [v1]
...
After:   [v2] [v2] [v2] [v2] [v2]

Pros: Gradual rollout, no extra infrastructure needed, easy to pause. Cons: Mixed versions in production simultaneously (requires backward compatibility), slower than other strategies.

Blue-Green Deployment

Maintain two identical production environments. Route all traffic to "blue" (current). Deploy to "green" (idle). Switch traffic all at once.

         [Load Balancer]
              
     ┌────────┴────────┐
                      
  [Blue: v1]       [Green: v2]
   traffic            idle

  (after switch)

  [Blue: v1]       [Green: v2]
    idle             traffic

Pros: Instant rollback (switch back to blue), no mixed versions. Cons: Requires double the infrastructure, database schema changes are tricky.

Canary Deployment

Route a small percentage of traffic to the new version, observe, then gradually increase.

[Load Balancer]
      
  ┌───┴──────────────┐
  ▼ (5%)             ▼ (95%)
[v2: canary]      [v1: stable]

Pros: Real production traffic validates the new version with limited exposure; statistical confidence before full rollout. Cons: Requires traffic routing capability, monitoring must be precise enough to detect problems at low traffic volume.

Feature Flags

Feature flags (also: feature toggles, feature gates) decouple code deployment from feature visibility. Code ships to production in a disabled state; the feature is enabled for specific users, percentages, or regions via configuration.

# Example: server-side feature flag check (pseudocode)
if feature_flags.is_enabled("new_checkout_flow", user=current_user):
    return render_new_checkout()
else:
    return render_old_checkout()

Feature flags enable: - A/B testing: Compare two versions for business metrics - Gradual rollouts: Enable for 1% → 10% → 50% → 100% - Kill switches: Instantly disable a problematic feature without deploying - Beta programs: Enable only for opted-in users

Feature flags must be managed carefully: old flags accumulate as technical debt if not cleaned up after full rollout.


6. Infrastructure as Code

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files rather than manual processes or interactive tools.

Core Principles

Idempotency: Applying the same configuration multiple times produces the same result. Running a provisioning script twice should not create duplicate resources.

Version control: Infrastructure configuration lives in the same repository as application code. Changes are reviewed, tested, and audited.

Declarative vs Imperative: - Declarative: Describe the desired end state; the tool figures out how to get there (Terraform, CloudFormation, Kubernetes manifests) - Imperative: Describe the steps to take (Ansible playbooks, shell scripts)

Modern IaC tends toward declarative because it is easier to reason about state and idempotency.

Testing infrastructure: IaC configurations should be tested just like application code — unit tests for modules, integration tests that provision real infrastructure in a test environment, and policy-as-code checks (e.g., "no S3 buckets are publicly readable").

IaC Categories

Category Purpose Examples
Provisioning Create infrastructure resources Terraform, Pulumi, CloudFormation
Configuration management Configure software on existing servers Ansible, Chef, Puppet
Container orchestration Manage container workloads Kubernetes, Nomad
Image building Create reusable machine images Packer, Docker
Policy as code Enforce compliance rules OPA, Sentinel, Checkov

7. Monitoring and Observability

Once software is deployed, you need to know whether it is working correctly. Observability is the degree to which the internal state of a system can be inferred from its external outputs.

The Three Pillars

Metrics

Metrics are numerical measurements sampled over time. They are cheap to store, easy to aggregate, and good for alerting.

Categories (USE method for resources, RED method for services): - Utilization: How busy is the resource? (CPU %, memory %) - Saturation: How much work is queued? (request queue depth) - Errors: What is the error rate? (HTTP 5xx rate) - Rate: How many requests per second? - Duration: How long do requests take? (p50, p95, p99 latency)

Logs

Logs are timestamped records of discrete events. They provide detail that metrics cannot — the exact error message, the request ID, the user affected.

Good logging practices: - Structured logging: Emit JSON rather than free text so logs are machine-parseable - Correlation IDs: Every request gets a unique ID; all log lines for that request carry it - Log levels: DEBUG, INFO, WARN, ERROR — use levels consistently; don't log everything at ERROR - Avoid logging sensitive data: Passwords, tokens, PII should never appear in logs

Traces

Distributed tracing tracks a request as it flows through multiple services. Each service records a span (start time, end time, service name, operation name); spans are connected by the same trace ID.

Traces answer: "Why was this request slow? Which service was the bottleneck?"

Alerting Principles

Not all problems need an immediate human response. Good alerting: - Alert on symptoms, not causes: "Error rate > 1%" rather than "CPU > 80%" (high CPU may not affect users) - Set thresholds on SLO burn rates: Alert when you're consuming error budget faster than sustainable - Minimize noise: Alert fatigue makes engineers ignore alerts — every alert that wakes someone at 3 AM should be actionable - Include runbook links: Every alert should link to a runbook explaining how to diagnose and resolve it


8. Incident Management

Even well-engineered systems fail. How an organization responds to and learns from incidents defines its reliability culture.

On-Call Practices

  • Rotation: Distribute on-call burden fairly across team members
  • Escalation paths: Define clear escalation chains so responders know who to call if they cannot resolve an incident
  • Runbooks: Pre-written procedures for known failure modes — on-call engineers should not have to improvise basic diagnostics
  • Incident severity levels: Define severity criteria (e.g., SEV1 = complete outage, SEV2 = degraded service) so responses are proportionate

Blameless Postmortems

A postmortem is a written document that records what happened, why, and what will be done to prevent recurrence. The "blameless" modifier is critical: the goal is systemic improvement, not individual punishment.

Why blameless? When people fear punishment, they hide information. A blameless culture means engineers will honestly report what happened — including their own mistakes — because the organization treats incidents as systems problems, not people problems. This transparency is the only way to actually improve.

Standard postmortem structure:

Incident Summary
  - Date, duration, severity, affected systems

Timeline
  - Chronological sequence of events (UTC timestamps)
  - Include detection, escalation, mitigation, resolution

Root Cause Analysis
  - What was the triggering cause?
  - What were the contributing factors?
  - Use "5 Whys" or fault tree analysis

Impact
  - Users affected, revenue impact, SLO burn

Action Items
  - Each item: description, owner, due date, priority
  - Distingush: immediate mitigations vs. long-term fixes

Lessons Learned
  - What went well? (detection, response)
  - What could be improved?
  - What surprised us?

Action items from postmortems must be tracked and completed. A postmortem that generates a list of improvements but nothing changes is worse than no postmortem — it signals that the organization does not actually care about improvement.


9. DORA Metrics

The DevOps Research and Assessment (DORA) program, led by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, is the largest ongoing scientific study of software delivery performance. Their research identifies four key metrics that distinguish high-performing from low-performing engineering organizations.

The Four DORA Metrics

Metric Definition Elite Performers Low Performers
Deployment Frequency How often code is deployed to production Multiple times per day Less than once per month
Lead Time for Changes Time from code commit to running in production Less than 1 hour 1–6 months
Change Failure Rate Percentage of deployments causing a production incident 0–15% 46–60%
Time to Restore Service (MTTR) Time to recover from a production failure Less than 1 hour 1 week to 1 month

What the Metrics Reveal

Deployment Frequency and Lead Time measure throughput — how fast value flows from idea to production.

Change Failure Rate and MTTR measure stability — how reliable the system is and how quickly it recovers.

The counterintuitive finding from DORA research: high throughput and high stability are positively correlated, not trade-offs. Elite teams deploy more frequently and have fewer failures. This is because frequent small changes are inherently less risky than infrequent large changes, and fast feedback loops catch problems earlier.

Using DORA Metrics

  • Use them as diagnostic signals, not performance reviews. Do not use deployment frequency to rank individual engineers.
  • Trend over time is more useful than absolute values. Is the team improving?
  • DORA metrics are lagging indicators — they reflect outcomes. To improve them, focus on leading practices (CI, test coverage, deployment automation, observability).

10. DevOps Maturity Model

Organizations adopt DevOps progressively. A maturity model helps teams understand where they are and what to work on next.

Level Name Characteristics
1 Initial Manual deployments, limited automation, siloed teams, incidents are crises
2 Developing Basic CI pipeline, automated testing, some shared ownership, post-incident reviews happen
3 Defined Continuous Delivery, IaC, monitoring with SLOs, blameless postmortems, deployment frequency weekly+
4 Managed Continuous Deployment, full observability, DORA metrics tracked, error budgets used
5 Optimizing Experimentation culture, chaos engineering, elite DORA metrics, contributing practices back to the industry

Most organizations are at Level 2–3. Moving from Level 3 to 4 typically requires cultural change more than additional tooling.


11. Cultural Aspects of DevOps

Technology is the easy part of DevOps. Culture is the hard part.

Breaking Down Silos

Traditional organizations separate developers, QA, security, and operations into separate departments with different managers, different incentives, and different vocabularies. DevOps requires that these groups work as a unified team toward shared outcomes.

Breaking silos requires: - Shared goals: A unified team has one set of OKRs, not separate department goals that conflict - Embedded expertise: Security and operations expertise lives inside product teams, not in a separate approval queue - Job rotation and shadowing: Developers spend time on-call; ops engineers contribute to CI pipeline development - Shared tools: A common observability platform, a common deployment platform

Shared Responsibility

"You build it, you run it" — Werner Vogels, Amazon CTO, 2006. The team that builds a service is responsible for operating it in production. This forces developers to care about operability: instrumentation, graceful degradation, runbooks, and alerting.

Shared responsibility does not mean every developer is on-call 24/7. It means: - Developers participate in on-call rotations for their services - Engineers design systems to be operable (not just functional) - Reliability is a feature, not an afterthought

Learning from Failure

High-performing DevOps organizations treat every failure as a learning opportunity. This requires psychological safety — the belief that one can speak up, admit mistakes, and ask questions without fear of punishment.

Leaders enable psychological safety by: - Conducting blameless postmortems themselves and modeling the behavior - Celebrating when engineers detect and report problems early - Never using incident data to evaluate individual performance - Acknowledging their own mistakes publicly


Summary

DevOps is a culture and set of practices — not a tool — that bridges the gap between software development and operations to enable fast, reliable software delivery.

Key concepts: - CALMS (Culture, Automation, Lean, Measurement, Sharing) describes the dimensions of DevOps maturity - CI ensures that every code change is verified by an automated build and test pipeline; broken builds are fixed immediately - Continuous Delivery means software is always deployable; Continuous Deployment means it is automatically deployed - Deployment strategies (rolling, blue-green, canary, feature flags) manage risk during releases - IaC treats infrastructure configuration with the same discipline as application code - Observability (metrics, logs, traces) provides the visibility needed to understand production systems - Blameless postmortems turn failures into learning opportunities without punishing individuals - DORA metrics provide evidence-based measures of engineering performance; high throughput and high stability are correlated, not trade-offs


Practice Exercises

  1. Pipeline Design: Sketch a CI/CD pipeline for a web application with a database backend. List every stage, what it checks, approximately how long it takes, and what constitutes a failure at that stage. Justify your stage ordering.

  2. Deployment Strategy Selection: A team runs a payment processing service with 99.99% uptime SLO. They need to deploy a new version that changes the database schema. Which deployment strategy would you recommend and why? What complications does the schema change introduce?

  3. DORA Assessment: A team deploys to production once per month, takes 3 weeks to get a commit to production, has a 30% change failure rate, and takes 2 days on average to recover from incidents. Using the DORA maturity categories, classify this team. Which metric would you target first for improvement, and what specific practices would you recommend?

  4. Postmortem Writing: An e-commerce site experienced a 45-minute outage on Black Friday when a database connection pool exhausted after a traffic spike. Write a postmortem outline: timeline (make up plausible details), root cause analysis using "5 Whys," impact estimate, and at least four action items with owners and priorities.

  5. Blameless Culture Analysis: A team lead says: "Our postmortems always identify the person who made the mistake. We need accountability." Write a response explaining the difference between accountability and blame, and why blameless postmortems actually lead to better outcomes. Reference psychological safety and what happens to information flow when people fear punishment.


Further Reading

  • Books:
  • The Phoenix Project — Gene Kim, Kevin Behr, George Spafford (novel-style introduction to DevOps principles)
  • Accelerate: The Science of Lean Software and DevOps — Nicole Forsgren, Jez Humble, Gene Kim (the research behind DORA metrics)
  • The DevOps Handbook — Gene Kim, Jez Humble, Patrick Debois, John Willis (comprehensive practice guide)
  • Site Reliability Engineering — Google (free online at sre.google) — SRE practices in detail

  • Research and Reports:

  • DORA State of DevOps Report (annual, dora.dev) — benchmark data on engineering performance
  • "Psychological Safety and Learning Behavior in Work Teams" — Amy Edmondson, ASQ 1999

  • Articles:

  • "Continuous Delivery" — Martin Fowler (martinfowler.com/bliki/ContinuousDelivery.html)
  • "Feature Toggles" — Pete Hodgson (martinfowler.com)
  • "CALMS: A DevOps Framework" — Atlassian documentation
  • "Blameless PostMortems and a Just Culture" — John Allspaw, Etsy Engineering Blog

Previous: Process Improvement | Next: Technical Documentation

to navigate between lessons