Books & Videos

Table of Contents

  1. Chapter 1 Introduction

    1. Monitoring, Alerting, and What They Can Do for You

    2. Monitoring and Alerting in a Nutshell

    3. The Challenges

    4. Important Terms

  2. Chapter 2 Monitoring

    1. The Building Blocks

    2. Drawing Conclusions from Timeseries Plots

  3. Chapter 3 Alerting

    1. The Challenge

    2. Prerequisites

    3. Understanding Failure and Its Impact

    4. Anatomy of an Alarm

    5. Case Study: A Data Pipeline

    6. Types of Alerts

    7. Setting Up Alarms

    8. Alerting Suggestions

  4. Chapter 4 At Scale

    1. Implications of Scale

    2. Composition of Large-Scale Systems

    3. Commonalities of Large-Scale Alerting Configurations

    4. Monitoring Coverage

    5. Managing Large Alerting Configurations

  5. Chapter 5 Monitoring in System Automation

    1. Choosing Appropriate Maintenance Times Automatically

    2. Controlling the Rate of Upgrade

    3. Recovery-Oriented Admission Control

    4. Automated Deployment and Rollback

  6. Chapter 6 The Work Environment

    1. Keeping an Audit Trail

    2. Working with Tickets

    3. Dealing with Anomalies

    4. Learning from Outages

    5. Using Checklists

    6. Creating Dashboards

    7. Service-Level Agreements

    8. Preventing the Ironies of Automation

    9. Culture

  7. Chapter 7 Measuring Success

    1. The Feedback Loop

    2. Ticket Reporting

    3. Measuring Detectability

    4. Transition to Automated Alarms

    5. Maintenance Overhead

    6. How (Not) to Measure

  8. Chapter 8 The Principles

    1. Get in the Habit of Measuring

    2. Draw Conclusions Reliably

    3. Monitor Extensively

    4. Alarm Selectively

    5. Work Smart, Not Hard

  1. Appendix Setting Up OpenTSDB

    1. The Software

    2. First Steps

    3. Gathering Data System-Wide

    4. Timeseries Plots

    5. Get Involved