A web application involves many specialists, but it takes people in web ops to ensure that everything works together throughout an application's lifetime. It's the expertise you need when your start-up gets an unexpected spike in web traffic, or when a new feature causes your mature application to fail. In this collection of essays and interviews, web veterans such as Theo Schlossnagle, Baron Schwartz, and Alistair Croll offer insights into this evolving field. You'll learn stories from the trenches--from builders of some of the biggest sites on the Web--on what's necessary to help a site thrive.
Learn the skills needed in web operations, and why they're gained through experience rather than schooling
Understand why it's important to gather metrics from both your application and infrastructure
Consider common approaches to database architectures and the pitfalls that come with increasing scale
Learn how to handle the human side of outages and degradations
Find out how one company avoided disaster after a huge traffic deluge
Discover what went wrong after a problem occurs, and how to prevent it from happening again
Contributors include:
John Allspaw
Heather Champ
Michael Christian
Richard Cook
Alistair Croll
Patrick Debois
Eric Florenzano
Paul Hammond
Justin Huff
Adam Jacob
Jacob Loomis
Matt Massie
Brian Moon
Anoop Nagwani
Sean Power
Eric Ries
Theo Schlossnagle
Baron SchwartzAndrew Shafer
Chapter 1 Web Operations: The Career
Why Does Web Operations Have It Tough?
From Apprentice to Master
Conclusion
Chapter 2 How Picnik Uses Cloud Computing: Lessons Learned
Where the Cloud Fits (and Why!)
Where the Cloud Doesn't Fit (for Picnik)
Conclusion
Chapter 3 Infrastructure and Application Metrics
Time Resolution and Retention Concerns
Locality of Metrics Collection and Storage
Layers of Metrics
Providing Context for Anomaly Detection and Alerts
Log Lines Are Metrics, Too
Correlation with Change Management and Incident Timelines
Making Metrics Available to Your Alerting Mechanisms
Using Metrics to Guide Load-Feedback Mechanisms
A Metrics Collection System, Illustrated: Ganglia
Conclusion
Chapter 4 Continuous Deployment
Small Batches Mean Faster Feedback
Small Batches Mean Problems Are Instantly Localized
Small Batches Reduce Risk
Small Batches Reduce Overhead
The Quality Defenders' Lament
Getting Started
Continuous Deployment Is for Mission-Critical Applications
Conclusion
Chapter 5 Infrastructure As Code
Service-Oriented Architecture
Conclusion
Chapter 6 Monitoring
Story: "The Start of a Journey"
Step 1: Understand What You Are Monitoring
Step 2: Understand Normal Behavior
Step 3: Be Prepared and Learn
Conclusion
Chapter 7 How Complex Systems Fail
How Complex Systems Fail
Further Reading
Chapter 8 Community Management and Web Operations
Chapter 9 Dealing with Unexpected Traffic Spikes
How It All Started
Alarms Abound
Putting Out the Fire
Surviving the Weekend
Preparing for the Future
CDN to the Rescue
Proxy Servers
Corralling the Stampede
Streamlining the Codebase
How Do We Know It Works?
The Real Test
Lessons Learned
Improvements Since Then
Chapter 10 Dev and Ops Collaboration and Cooperation
Deployment
Shared, Open Infrastructure
Trust
On-call Developers
Avoiding Blame
Conclusion
Chapter 11 How Your Visitors Feel: User-Facing Metrics
Why Collect User-Facing Metrics?
What Makes a Site Slow?
Measuring Delay
Building an SLA
Visitor Outcomes: Analytics
Other Metrics Marketing Cares About
How User Experience Affects Web Ops
The Future of Web Monitoring
Conclusion
Chapter 12 Relational Database Strategy and Tactics for the Web
Requirements for Web Databases
How Typical Web Databases Grow
The Yearning for a Cluster
Database Strategy
Database Tactics
Conclusion
Chapter 13 How to Make Failure Beautiful: The Art and Science of Postmortems
The Worst Postmortem
What Is a Postmortem?
When to Conduct a Postmortem
Who to Invite to a Postmortem
Running a Postmortem
Postmortem Follow-Up
Conclusion
Chapter 14 Storage
Data Asset Inventory
Data Protection
Capacity Planning
Storage Sizing
Operations
Conclusion
Chapter 15 Nonrelational Databases
NoSQL Database Overview
Some Systems in Detail
Conclusion
Chapter 16 Agile Infrastructure
Agile Infrastructure
So, What's the Problem?
Communities of Interest and Practice
Trading Zones and Apologies
Conclusion
Chapter 17 Things That Go Bump in the Night (and How to Sleep Through Them)
John Allspaw is currently Operations Engineering Manager at Flickr, the popular photo site. He has had extensive experience working with growing web sites since 1999. These include online news magazines (Salon.com, InfoWorld.com, Macworld.com) and social networking sites that experienced extreme growth (Friendster and Flickr). During his time at Friendster, traffic increased 5X. He was responsible for their transition from a couple dozen servers in a failing data center to over 400 machines across two data centers, and the complete redesign of the backing infrastructure. When he joined Flickr, they had 10 servers in a tiny data center in Vancouver; they are now located in multiple data centers across the US. Prior to his web experience, Allspaw worked in modeling and simulation as a mechanical engineer doing car crash simulations for the NHTSA.
Jesse Robbins is passionate about infrastructure, emergency management, and technology that helps people be safe, happy, and free. He serves as co-chair of the Velocity Performance & Operations Conference and is part of the O'Reilly Radar. Jesse currently advises companies in Seattle and San Francisco. He previously worked at Amazon.com where his title was "Master of Disaster" and where he was responsible for Website Availability. Jesse is a volunteer Firefighter/EMT & Emergency Manager, and led a task force deployed in Operation Hurricane Katrina.
Think of this book as a post-graduate level "Introduction to Internet Support". The authors advocate all those things experienced technicians know make the real difference; metrics, disaster planning, cross-team communication...the list goes on and on.
If you're a technician, read this book and start working the practices. Graph some performance, spend time with the coders, think through how you might deal with double or triple your current traffic or server load. You will become the "go to" person when there are questions and your career will get a lot more fun!
At the (Project) Manager level? Buy copies for everyone on your team and start enabling them. Focus on one or two avenues and break down the barriers to effective efficiency. Demonstrate the advantages to your senior managment so they green light bigger, more challenging tasks. Find those one or two folks whose minds are open to the possibilities and give them a copy of Making Things Happen: Mastering Project Management (Theory in Practice (O'Reilly)). Expect others to look to you for advice.
This isn't a "Try this code" sort of book! There's a bit of challenge if you go to work, ask about metrics, and get blank stares. Challenge...opportunity...options. Read the book, find what really excites you, and go make things better.