Books & Videos

Table of Contents

  1. Introduction

    1. Chapter 1 Introduction

      1. The Sysadmin Approach to Service Management
      2. Google’s Approach to Service Management: Site Reliability Engineering
      3. Tenets of SRE
      4. The End of the Beginning
    2. Chapter 2 The Production Environment at Google, from the Viewpoint of an SRE

      1. Hardware
      2. System Software That “Organizes” the Hardware
      3. Other System Software
      4. Our Software Infrastructure
      5. Our Development Environment
      6. Shakespeare: A Sample Service
  2. Principles

    1. Chapter 3 Embracing Risk

      1. Managing Risk
      2. Measuring Service Risk
      3. Risk Tolerance of Services
      4. Motivation for Error Budgets
    2. Chapter 4 Service Level Objectives

      1. Service Level Terminology
      2. Indicators in Practice
      3. Objectives in Practice
      4. Agreements in Practice
    3. Chapter 5 Eliminating Toil

      1. Toil Defined
      2. Why Less Toil Is Better
      3. What Qualifies as Engineering?
      4. Is Toil Always Bad?
      5. Conclusion
    4. Chapter 6 Monitoring Distributed Systems

      1. Definitions
      2. Why Monitor?
      3. Setting Reasonable Expectations for Monitoring
      4. Symptoms Versus Causes
      5. Black-Box Versus White-Box
      6. The Four Golden Signals
      7. Worrying About Your Tail (or, Instrumentation and Performance)
      8. Choosing an Appropriate Resolution for Measurements
      9. As Simple as Possible, No Simpler
      10. Tying These Principles Together
      11. Monitoring for the Long Term
      12. Conclusion
    5. Chapter 7 The Evolution of Automation at Google

      1. The Value of Automation
      2. The Value for Google SRE
      3. The Use Cases for Automation
      4. Automate Yourself Out of a Job: Automate ALL the Things!
      5. Soothing the Pain: Applying Automation to Cluster Turnups
      6. Borg: Birth of the Warehouse-Scale Computer
      7. Reliability Is the Fundamental Feature
      8. Recommendations
    6. Chapter 8 Release Engineering

      1. The Role of a Release Engineer
      2. Philosophy
      3. Continuous Build and Deployment
      4. Configuration Management
      5. Conclusions
    7. Chapter 9 Simplicity

      1. System Stability Versus Agility
      2. The Virtue of Boring
      3. I Won’t Give Up My Code!
      4. The “Negative Lines of Code” Metric
      5. Minimal APIs
      6. Modularity
      7. Release Simplicity
      8. A Simple Conclusion
  3. Practices

    1. Chapter 10 Practical Alerting from Time-Series Data

      1. The Rise of Borgmon
      2. Instrumentation of Applications
      3. Collection of Exported Data
      4. Storage in the Time-Series Arena
      5. Rule Evaluation
      6. Alerting
      7. Sharding the Monitoring Topology
      8. Black-Box Monitoring
      9. Maintaining the Configuration
      10. Ten Years On…
    2. Chapter 11 Being On-Call

      1. Introduction
      2. Life of an On-Call Engineer
      3. Balanced On-Call
      4. Feeling Safe
      5. Avoiding Inappropriate Operational Load
      6. Conclusions
    3. Chapter 12 Effective Troubleshooting

      1. Theory
      2. In Practice
      3. Negative Results Are Magic
      4. Case Study
      5. Making Troubleshooting Easier
      6. Conclusion
    4. Chapter 13 Emergency Response

      1. What to Do When Systems Break
      2. Test-Induced Emergency
      3. Change-Induced Emergency
      4. Process-Induced Emergency
      5. All Problems Have Solutions
      6. Learn from the Past. Don’t Repeat It.
      7. Conclusion
    5. Chapter 14 Managing Incidents

      1. Unmanaged Incidents
      2. The Anatomy of an Unmanaged Incident
      3. Elements of Incident Management Process
      4. A Managed Incident
      5. When to Declare an Incident
      6. In Summary
    6. Chapter 15 Postmortem Culture: Learning from Failure

      1. Google’s Postmortem Philosophy
      2. Collaborate and Share Knowledge
      3. Introducing a Postmortem Culture
      4. Conclusion and Ongoing Improvements
    7. Chapter 16 Tracking Outages

      1. Escalator
      2. Outalator
    8. Chapter 17 Testing for Reliability

      1. Types of Software Testing
      2. Creating a Test and Build Environment
      3. Testing at Scale
      4. Conclusion
    9. Chapter 18 Software Engineering in SRE

      1. Why Is Software Engineering Within SRE Important?
      2. Auxon Case Study: Project Background and Problem Space
      3. Intent-Based Capacity Planning
      4. Fostering Software Engineering in SRE
      5. Conclusions
    10. Chapter 19 Load Balancing at the Frontend

      1. Power Isn’t the Answer
      2. Load Balancing Using DNS
      3. Load Balancing at the Virtual IP Address
    11. Chapter 20 Load Balancing in the Datacenter

      1. The Ideal Case
      2. Identifying Bad Tasks: Flow Control and Lame Ducks
      3. Limiting the Connections Pool with Subsetting
      4. Load Balancing Policies
    12. Chapter 21 Handling Overload

      1. The Pitfalls of “Queries per Second”
      2. Per-Customer Limits
      3. Client-Side Throttling
      4. Criticality
      5. Utilization Signals
      6. Handling Overload Errors
      7. Load from Connections
      8. Conclusions
    13. Chapter 22 Addressing Cascading Failures

      1. Causes of Cascading Failures and Designing to Avoid Them
      2. Preventing Server Overload
      3. Slow Startup and Cold Caching
      4. Triggering Conditions for Cascading Failures
      5. Testing for Cascading Failures
      6. Immediate Steps to Address Cascading Failures
      7. Closing Remarks
    14. Chapter 23 Managing Critical State: Distributed Consensus for Reliability

      1. Motivating the Use of Consensus: Distributed Systems Coordination Failure
      2. How Distributed Consensus Works
      3. System Architecture Patterns for Distributed Consensus
      4. Distributed Consensus Performance
      5. Deploying Distributed Consensus-Based Systems
      6. Monitoring Distributed Consensus Systems
      7. Conclusion
    15. Chapter 24 Distributed Periodic Scheduling with Cron

      1. Cron
      2. Cron Jobs and Idempotency
      3. Cron at Large Scale
      4. Building Cron at Google
      5. Summary
    16. Chapter 25 Data Processing Pipelines

      1. Origin of the Pipeline Design Pattern
      2. Initial Effect of Big Data on the Simple Pipeline Pattern
      3. Challenges with the Periodic Pipeline Pattern
      4. Trouble Caused By Uneven Work Distribution
      5. Drawbacks of Periodic Pipelines in Distributed Environments
      6. Introduction to Google Workflow
      7. Stages of Execution in Workflow
      8. Ensuring Business Continuity
      9. Summary and Concluding Remarks
    17. Chapter 26 Data Integrity: What You Read Is What You Wrote

      1. Data Integrity’s Strict Requirements
      2. Google SRE Objectives in Maintaining Data Integrity and Availability
      3. How Google SRE Faces the Challenges of Data Integrity
      4. Case Studies
      5. General Principles of SRE as Applied to Data Integrity
      6. Conclusion
    18. Chapter 27 Reliable Product Launches at Scale

      1. Launch Coordination Engineering
      2. Setting Up a Launch Process
      3. Developing a Launch Checklist
      4. Selected Techniques for Reliable Launches
      5. Development of LCE
      6. Conclusion
  4. Management

    1. Chapter 28 Accelerating SREs to On-Call and Beyond

      1. You’ve Hired Your Next SRE(s), Now What?
      2. Initial Learning Experiences: The Case for Structure Over Chaos
      3. Creating Stellar Reverse Engineers and Improvisational Thinkers
      4. Five Practices for Aspiring On-Callers
      5. On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
      6. Closing Thoughts
    2. Chapter 29 Dealing with Interrupts

      1. Managing Operational Load
      2. Factors in Determining How Interrupts Are Handled
      3. Imperfect Machines
    3. Chapter 30 Embedding an SRE to Recover from Operational Overload

      1. Phase 1: Learn the Service and Get Context
      2. Phase 2: Sharing Context
      3. Phase 3: Driving Change
      4. Conclusion
    4. Chapter 31 Communication and Collaboration in SRE

      1. Communications: Production Meetings
      2. Collaboration within SRE
      3. Case Study of Collaboration in SRE: Viceroy
      4. Collaboration Outside SRE
      5. Case Study: Migrating DFP to F1
      6. Conclusion
    5. Chapter 32 The Evolving SRE Engagement Model

      1. SRE Engagement: What, How, and Why
      2. The PRR Model
      3. The SRE Engagement Model
      4. Production Readiness Reviews: Simple PRR Model
      5. Evolving the Simple PRR Model: Early Engagement
      6. Evolving Services Development: Frameworks and SRE Platform
      7. Conclusion
  5. Conclusions

    1. Chapter 33 Lessons Learned from Other Industries

      1. Meet Our Industry Veterans
      2. Preparedness and Disaster Testing
      3. Postmortem Culture
      4. Automating Away Repetitive Work and Operational Overhead
      5. Structured and Rational Decision Making
      6. Conclusions
    2. Chapter 34 Conclusion

    3. Appendix Availability Table

    4. Appendix A Collection of Best Practices for Production Services

      1. Fail Sanely
      2. Progressive Rollouts
      3. Define SLOs Like a User
      4. Error Budgets
      5. Monitoring
      6. Postmortems
      7. Capacity Planning
      8. Overloads and Failure
      9. SRE Teams
    5. Appendix Example Incident State Document

    6. Appendix Example Postmortem

      1. Lessons Learned
      2. Timeline
      3. Supporting information:
    7. Appendix Launch Coordination Checklist

    8. Appendix Example Production Meeting Minutes