Seeking SRE

Book description

Organizations big and small have started to realize just how crucial system and application reliability is to their business. Theyâ??ve also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace. Site Reliability Engineering (SRE) is a proven approach to this challenge.

SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful Oâ??Reilly book that described Googleâ??s creation of the discipline and the implementation thatâ??s allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.

Listen as engineers and other leaders in the field discuss:

  • Different ways of implementing SRE and SRE principles in a wide variety of settings
  • How SRE relates to other approaches such as DevOps
  • Specialties on the cutting edge that will soon be commonplace in SRE
  • Best practices and technologies that make practicing SRE easier
  • The important but rarely explored human side of SRE

David N. Blank-Edelman is the bookâ??s curator and editor.

Publisher resources

View/Submit Errata

Table of contents

  1. Introduction
    1. And So It Begins...
    2. Origin Story
    3. Voices
    4. Forward in All Directions!1
    5. Acknowledgments
  2. I. SRE Implementation
  3. 1. Context Versus Control in SRE
  4. 2. Interviewing Site Reliability Engineers
    1. Interviewing 101
      1. Who Is Involved
      2. Industry Versus University
      3. Biases
      4. The Funnel
    2. SRE Funnels
      1. Phone Screens
      2. The Onsite Interview
      3. Take-Home Questions
      4. Advice for Hiring Managers
    3. Final Thoughts on Interviewing SREs
    4. Further Reading
  5. 3. So, You Want to Build an SRE Team?
    1. Choose SRE for the Right Reasons
    2. Orienting to a Data-Driven Approach
    3. Commitment to SRE
    4. Making a Decision About SRE
  6. 4. Using Incident Metrics to Improve SRE at Scale
    1. The Virtuous Cycle to the Rescue: If You Don’t Measure It…
    2. Metrics Review: If a Metric Falls in the Forest…
    3. Surrogate Metrics
    4. Repair Debt
    5. Virtual Repair Debt: Exorcising the Ghost in the Machine
    6. Real-Time Dashboards: The Bread and Butter of SRE
    7. Learnings: TL;DR
    8. Further Reading
  7. 5. Working with Third Parties Shouldn’t Suck
    1. Build, Buy, or Adopt?
      1. Establish Importance
      2. Identify Stakeholders
      3. Make a Decision
      4. Acknowledge Reality
    2. Third Parties as First-Class Citizens
      1. When They’re Down, You’re Down
      2. Running the Black Box Like a Service
      3. Service-Level Indicators, Service-Level Objectives, and SLAs
      4. Playbook: From Staging to Production
    3. Closing Thoughts
  8. 6. How to Apply SRE Principles Without Dedicated SRE Teams
    1. SREs to the Rescue! (and How They Failed)
      1. A Matter of Scale in Terms of Headcount
      2. The Embedded SRE
    2. You Build It, You Run It
      1. The Deployment Platform
      2. Closing the Loop: Take Your Own Pager
      3. Introducing Production Engineering
    3. Some Implementation Details
      1. Developers’ Productivity and Health Versus the Pager
      2. Resolving Cross-Team Reliability Issues by Using Postmortems
      3. Uniform Infrastructure and Tooling Versus Autonomy and Innovation
      4. Getting Buy-In
    4. Conclusion
    5. Further Reading
  9. 7. SRE Without SRE: The Spotify Case Study
    1. Tabula Rasa: 2006–2007
      1. Prelude
      2. Key Learnings
    2. Beta and Release: 2008–2009
      1. Prelude
      2. Bringing Scalability and Reliability to the Forefront
      3. Key Learnings
    3. The Curse of Success: 2010
      1. Prelude
      2. A New Ownership Model
      3. Formalizing Core Services
      4. Blessed Deployment Time Slots
      5. On-Call and Alerting
      6. Spawning Off Internal Office Support
      7. Addressing the Remaining Top Concerns
      8. Creating Detectives
      9. Key Learnings
    4. Pets and Cattle, and Agile: 2011
      1. Prelude
      2. Forming Bad Habits
      3. Breaking Those Bad Habits
      4. Key Learnings
    5. A System That Didn’t Scale: 2012
      1. Prelude
      2. Manual Work Hits a Cliff
      3. Key Learnings
    6. Introducing Ops-in-Squads: 2013–2015
      1. Prelude
      2. Building on Trust
      3. Driving the Paradigm Shift
      4. Key Learnings
    7. Autonomy Versus Consistency: 2015–2017
      1. Prelude
      2. Benefits
      3. Trade-Offs
      4. Key Learnings
    8. The Future: Speed at Scale, Safely
  10. 8. Introducing SRE in Large Enterprises
    1. Background
    2. Introducing SRE
      1. Defining Current State
      2. Identifying and Educating Stakeholders
      3. Presenting the Business Case
      4. Implementing the SRE Team
      5. Lessons Learned
      6. Sample Implementation Roadmap
    3. Closing Thoughts
    4. Further Reading
  11. 9. From SysAdmin to SRE in 8,963 Words
    1. Clarifying Terminology
      1. Service-Level Indicator
      2. SLA
      3. Service-Level Objective
    2. Establishing SLAs for Internal Components
    3. Understanding External Dependencies
    4. Nontechnical Solutions
    5. Tracking Availability Level
    6. Dealing with Corner Cases
    7. Conclusion
  12. 10. Clearing the Way for SRE in the Enterprise
    1. Toil, the Enemy of SRE
    2. Toil in the Enterprise
    3. Silos, Queues, and Tickets
      1. Silos Get in the Way
      2. Ticket-Driven Request Queues Are Expensive
    4. Take Action Now
    5. Start by Leaning on Lean
    6. Get Rid of as Many Handoffs as Possible
    7. Replace Remaining Handoffs with Self-Service
      1. Self-Service Is More Than a Button
      2. Self-Service Helps SREs in Multiple Ways
      3. Operations as a Service
    8. Error Budgets, Toil Limits, and Other Tools for Empowering Humans
      1. Error Budgets
      2. Toil Limits
      3. Leverage Existing Enthusiasm for DevOps
      4. Unify Backlogs and Protect Capacity
      5. Psychological Safety and Human Factors
    9. Join the Movement
  13. 11. SRE Patterns Loved by DevOps People Everywhere
    1. Pattern 1: Birth of Automated Testing at Google
    2. Pattern 2: Launch and Handoff Readiness Review at Google
    3. Pattern 3: Create a Shared Source Code Repository
    4. Conclusion
    5. Further Reading and Source Material
  14. 12. DevOps and SRE: Voices from the Community
    1. Background
    2. Method
    3. Results
    4. Replies
  15. 13. Production Engineering at Facebook
  16. II. Near Edge SRE
  17. 14. In the Beginning, There Was Chaos
    1. The Problem with Systems
    2. Economic Pillars of Complexity
    3. Beginning Chaos
    4. Navigating Complexity for Safety
    5. Chaos Goes Big
    6. Formalization
    7. Advanced Principles
    8. Frequently Asked Questions
    9. Conclusion
  18. 15. The Intersection of Reliability and Privacy
    1. The Intersection of Reliability and Privacy
    2. The General Landscape of Privacy Engineering
    3. Privacy and SRE: Common Approaches
      1. Reducing Toil
      2. Efficient and Deliberate Problem Solving
      3. Relationship Management
      4. Early Intervention and Education Through Evangelism
    4. Nuances, Differences, and Trade-Offs
    5. Conclusion
    6. Further Reading
  19. 16. Database Reliability Engineering
    1. Guiding Principles of the Database Reliability Engineer
      1. Protect the Data
      2. Self-Service for Scale
      3. Databases Are Not Special
    2. A Culture of Database Reliability Engineering
    3. Recoverability
      1. Considerations for Recovery
      2. Anatomy of a Recovery Strategy
      3. Building Block 1: Detection
      4. Building Block 2: Diverse Storage
      5. Building Block 3: A Varied Toolbox
      6. Building Block 4: Testing
      7. Championing Recovery Reliability
    4. Continuous Delivery: From Development to Production
      1. Education and Collaboration
    5. Collaboration
    6. Deployment
      1. Migrations and Versioning
      2. Impact Analysis
      3. Migration Patterns
      4. Championing CD
    7. Making the Case for DBRE
    8. Further Reading
  20. 17. Engineering for Data Durability
    1. Replication Is Table Stakes
      1. Backups
      2. Replication
    2. Real-World Durability
      1. Isolation
    3. Protection
      1. Testing
      2. Safeguards
      3. Recovery
    4. Verification
      1. The Power of Zero
      2. Verification Coverage
      3. Watching the Watchers
    5. Automation
      1. Window of Vulnerability
      2. Operator Fatigue
      3. Reliability
    6. Conclusion
  21. 18. Introduction to Machine Learning for SRE
    1. Why Use Machine Learning for SRE?
    2. Why and How Should My Company Be Engaging in This?
      1. Some SRE Problems Machine Learning Can Help Solve
    3. The Awakening of Applied AI
    4. What Is Machine Learning?
      1. What Do We Mean by Learning?
      2. From Chess to Go: How Deep Can We Dive?
      3. Why Now? What Changed for Us?
    5. What Are Neural Networks?
      1. Neurons and Neural Networks
      2. How and When Should We Apply Neural Networks?
      3. What Kinds of Data Can We Use?
    6. Practical Machine Learning
      1. Popular Libraries for Neural Networks
      2. Practical Machine Learning Examples
    7. Success Stories
    8. Further Reading
      1. My GitHub Repository
      2. Recommended Books
  22. III. SRE Best Practices and Technologies
  23. 19. Do Docs Better: Integrating Documentation into the Engineering Workflow
    1. Defining Quality: What Do Good Docs Look Like?
      1. Functional Requirements for SRE Documentation
    2. Integrating Docs into the Engineering Workflow
      1. The Google Experience: g3doc and EngPlay
      2. What We Learned
    3. Doing Docs Better: Best Practices
      1. Create Templates for Each Documentation Type
      2. Better > Best: Set Realistic Standards for Quality
      3. Require Docs as Part of Code Review
      4. Ruthlessly Prune Your Docs
      5. Recognize and Reward Documentation
    4. Communicating the Value of Documentation
    5. Further Reading
  24. 20. Active Teaching and Learning
    1. Active Learning
      1. Active Learning Example: Wheel of Misfortune
      2. Active Learning Example: Incident Manager (a Card Game)
      3. Active Learning Example: SRE Classroom
    2. The Costs of Failing to Learn
    3. Learning Habits of Effective SRE Teams
      1. Production Meetings
      2. Postmortems
    4. A Call to Action: Ditch the Boring Slides
  25. 21. The Art and Science of the Service-Level Objective
    1. Why Set Goals?
    2. Availability
      1. Time Quanta
      2. Transactions
      3. Transactions over Time Quanta
    3. On Evaluating SLOs
    4. Histograms
    5. Where Percentiles Fall Down (and Histograms Step Up)
    6. Parting Thought: Looking at SLOs Upside Down
    7. Further Reading
  26. 22. SRE as a Success Culture
    1. Where Did SRE Come From?
    2. Key Values for SRE
      1. Keeping the Site Up
      2. Empowering Teams to “Do the Right Thing”
      3. Approaching Operations as an Engineering Problem
      4. Achieving Business Success Through Promises (Service Levels)
    3. Critical Enabling Functions of SRE
      1. Monitoring, Metrics, and KPIs
      2. Incident Management and Emergency Response
      3. Capacity Planning and Demand Forecasting
      4. Performance Analysis and Optimization
      5. Provisioning, Change Management, and Velocity
    4. Phases of SRE Execution
      1. Phase 1: Firefighting/Reactive
      2. Phase 2: Gatekeepers
      3. Phase 3: Advocates/Partners
      4. Phase 4: Catalytic
      5. Complications of Differing Phases
    5. Focus on the Details of Success
    6. Further Reading
  27. 23. SRE Antipatterns
    1. Antipattern 1: Site Reliability Operations
    2. Antipattern 2: Humans Staring at Screens
    3. Antipattern 3: Mob Incident Response
    4. Antipattern 4: Root Cause = Human Error
    5. Antipattern 5: Passing the Pager
    6. Antipattern 6: Magic Smoke Jumping!
    7. Antipattern 7: Alert Reliability Engineering
    8. Antipattern 8: Hiring a Dog-Walker to Tend Your Pets
    9. Antipattern 9: Speed-Bump Engineering
    10. Antipattern 10: Design Chokepoints
    11. Antipattern 11: Too Much Stick, Not Enough Carrot
    12. Antipattern 12: Postponing Production
    13. Antipattern 13: Optimizing Failure Avoidance Rather Than Recovery Time (MTTF > MTTR)
    14. Antipattern 14: Dependency Hell
    15. Antipattern 15: Ungainly Governance
    16. Antipattern 16: Ill-Considered SLOh-Ohs
    17. Antipattern 17: Tossing Your API Over the Firewall
    18. Antipattern 18: Fixing the Ops Team
    19. So, That’s It, Then?
  28. 24. Immutable Infrastructure and SRE
    1. Scalability, Reliability, and Performance
    2. Failure Recovery
    3. Simpler Operations
    4. Faster Startup Times
    5. Known State
    6. Continuous Integration/Continuous Deployment with Confidence
    7. Security
    8. Multiregion Operations
    9. Release Engineering
    10. Building the Base Image
    11. Deploying Applications
    12. Disadvantages
    13. Conclusion
  29. 25. Scriptable Load Balancers
    1. Scriptable Load Balancers: The New Kid on the Block
      1. Why Scriptable Load Balancers?
    2. Making the Difficult Easy
      1. Shard-Aware Routing
      2. Harnessing Potential
      3. Case Study: Intermission
    3. Service-Level Middleware
      1. Middleware to the Rescue
      2. APIs of Service-Level Middleware
      3. Case Study: WAF/Bot Mitigation
    4. Avoiding Disaster
      1. Getting Clever with State
      2. Case Study: Checkout Queue
    5. Looking to the Future and Further Reading
  30. 26. The Service Mesh: Wrangler of Your Microservices?
    1. Ready to Get Rid of the Monolith?
    2. Current State of Microservice Networking
    3. Service Mesh to the Rescue
      1. The Benefits of a Sidecar Proxy
      2. Eventually Consistent Service Discovery
      3. Observability and Alarming
      4. Sidecar Performance Implications
      5. Thin Libraries and Context Propagation
      6. Configuration Management (Control Plane Versus Data Plane)
    4. The Service Mesh in Practice
      1. The Origin and Development of Envoy at Lyft
      2. Operating Envoy at Lyft
    5. The Future of the Service Mesh
    6. Further Reading
  31. IV. The Human Side of SRE
  32. 27. Psychological Safety in SRE
    1. The Primary Indicator of a Successful Team
      1. How to Build Psychological Safety into Your Own Team
    2. Further Reading
  33. 28. SRE Cognitive Work
    1. Introduction
    2. What Do SRE People Do?
    3. Why Should We Care About Practitioner Cognition?
      1. Critical Decisions Made Under Uncertainty and Time Pressure Cannot Be Scripted
      2. Human Performance in Modern Complex Systems: The Main Themes
    4. Observations on SRE Cognitive Work Around Incidents
      1. Every Incident Could Have Been Worse
      2. Sacrifice Decisions Take Place Under Uncertainty
      3. Repairs to Functional Systems
      4. Special Knowledge About Complex Systems
      5. Managing the Costs of Coordination
      6. SREs Are Cognitive Agents Working in a Joint Cognitive System
    5. The Calibration Problem
      1. Mental Models
      2. Incidents Trigger Individual Recalibration
      3. Incidents Are Opportunities for Collective Recalibration
    6. What Are the Implications of All This?
      1. Incidents Will Continue
      2. Incidents Will Impose Costs
      3. Incident Patterns Will Change
      4. Incidents Point to Specific Calibration Problems and Locations
    7. What Should Happen Next?
      1. Build a Corpus of Cases
      2. Focus on Making Automation a Team Player in SRE Work
      3. Address the Calibration Problem
    8. What Can You Do?
    9. Conclusion
    10. References
  34. 29. Beyond Burnout
    1. Defining Mental Disorders
    2. Mental Disorders Are Missing from the Diversity Conversation
    3. Sanity Isn’t a Business Requirement
    4. Thoughts and Prayers Aren’t Scalable
    5. Full-Stack Inclusivity
      1. Application
      2. Interviewing
      3. Compensation
      4. Benefits
      5. Onboarding
      6. Working Conditions
      7. Job Duties
      8. Training
      9. Promotion
      10. Leaving
    6. Inclusivity for Anyone Helps Everyone
    7. Mental Disorder Resources
  35. 30. Against On-Call: A Polemic
    1. The Rationale for On-Call
      1. First, Do No Harm
      2. Parallels with SRE
      3. Differences with SRE
      4. Underlying Assumptions Driving On-Call for Engineers
      5. On-Call Is Emergency Medicine Instead of Ward Medicine
      6. Counterarguments
    2. The Cost to Humans of Doing On-Call
      1. We don’t need another hero
    3. Actual Solutions
      1. Training
      2. Prioritization
      3. Improving On-the-Job Performance
    4. We Need a Fundamental Change in Approach
      1. Strong-Anti-On-Call
      2. Weak-Anti-On-Call
      3. A Union of the Two
    5. Conclusion
  36. 31. Elegy for Complex Systems
    1. The Computer and Human Systems Cannot Be Separated
    2. Decoherence and Cascading Failure
    3. Always in a State of Partial Failure
    4. Novelty Priority Inversion
    5. Nobody Anticipates the Overhead of Coordination
    6. Your healthcare.gov Is Out There
      1. To Get Involved
    7. Further Reading
  37. 32. Intersections Between Operations and Social Activism
    1. Before, During, After
      1. Creating the Perfect Plan
      2. Principles of Organizing
      3. Managing Crisis: Responding When Things Break Down
      4. Writing Our Own History: Making Sense of What Went Down
    2. The Long Tail: Turning Action into Change
      1. Activism and Change Within a Company
    3. Conclusion
  38. 33. Conclusion
  39. Index

Product information

  • Title: Seeking SRE
  • Author(s): David N. Blank-Edelman
  • Release date: September 2018
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491978863