Building Scalable Web Sites

Book description

Learn the tricks of the trade so you can build and architect applications that scale quickly--without all the high-priced headaches and service-level agreements associated with enterprise app servers and proprietary programming and database products. Culled from the experience of the Flickr.com lead developer, Building Scalable Web Sites offers techniques for creating fast sites that your visitors will find a pleasure to use.

Creating popular sites requires much more than fast hardware with lots of memory and hard drive space. It requires thinking about how to grow over time, how to make the same resources accessible to audiences with different expectations, and how to have a team of developers work on a site without creating new problems for visitors and for each other.

Presenting information to visitors from all over the world

Integrating email with your web applications

Planning hardware purchases and hosting options to have as much as you need without breaking your wallet

Partitioning and distributing databases to support large datasets and simultaneous transactions

Monitoring your applications to find and clear bottlenecks

* Providing services APIs and using services from other providers to increase your site's reach and capabilities

Whether you're starting a small web site with hopes of growing big or you already have a large system that needs maintenance, you'll find Building Scalable Web Sites to be a library of ideas for making things work.

Publisher resources

View/Submit Errata

Table of contents

  1. A Note Regarding Supplemental Files
  2. Preface
    1.  
    2. What This Book Is About
    3. What You Need to Know
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Enabled
    7. How to Contact Us
    8. Acknowledgments
  3. 1. Introduction
    1. 1.1. What Is a Web Application?
    2. 1.2. How Do You Build Web Applications?
    3. 1.3. What Is Architecture?
    4. 1.4. How Do I Get Started?
  4. 2. Web Application Architecture
    1. 2.1. Layered Software Architecture
    2. 2.2. Layered Technologies
    3. 2.3. Software Interface Design
    4. 2.4. Getting from A to B
    5. 2.5. The Software/Hardware Divide
    6. 2.6. Hardware Platforms
      1. 2.6.1. Shared Hardware
      2. 2.6.2. Dedicated Hardware
      3. 2.6.3. Co-Located Hardware
      4. 2.6.4. Self-Hosting
    7. 2.7. Hardware Platform Growth
      1. 2.7.1. Availability and Lead Times
      2. 2.7.2. Importing, Shipping, and Staging
      3. 2.7.3. Space
      4. 2.7.4. Power
      5. 2.7.5. NOC Facilities
      6. 2.7.6. Connectivity
    8. 2.8. Hardware Redundancy
    9. 2.9. Networking
    10. 2.10. Languages, Technologies, and Databases
  5. 3. Development Environments
    1. 3.1. The Three Rules
    2. 3.2. Use Source Control
      1. 3.2.1. What Is Source Control?
        1. 3.2.1.1. Versioning
        2. 3.2.1.2. Rollback
        3. 3.2.1.3. Logs
        4. 3.2.1.4. Diffs
        5. 3.2.1.5. Multiuser editing and merging
        6. 3.2.1.6. Annotation (blame)
        7. 3.2.1.7. The locking debate
        8. 3.2.1.8. Projects and modules
        9. 3.2.1.9. Tagging
        10. 3.2.1.10. Branching
        11. 3.2.1.11. Merging
      2. 3.2.2. Utilities—the “Nice to Haves”
        1. 3.2.2.1. Shell and editor integration
        2. 3.2.2.2. Web interfaces
        3. 3.2.2.3. Commit-log mailing list
        4. 3.2.2.4. Commit-log RSS feed
        5. 3.2.2.5. Commit database
        6. 3.2.2.6. Commit hooks
      3. 3.2.3. Source-Control Products
      4. 3.2.4. The Revision Control System (RCS)
        1. 3.2.4.1. The Concurrent Versions System (CVS)
          1. 3.2.4.1.1. Client availability
          2. 3.2.4.1.2. Web interfaces
          3. 3.2.4.1.3. Mailing list and RSS feed
          4. 3.2.4.1.4. Commit database
        2. 3.2.4.2. Subversion (SVN)
          1. 3.2.4.2.1. Client availability
          2. 3.2.4.2.2. Web interfaces
          3. 3.2.4.2.3. Mailing list and RSS feed
          4. 3.2.4.2.4. Commit database
        3. 3.2.4.3. Perforce
          1. 3.2.4.3.1. Client availability
          2. 3.2.4.3.2. Web interfaces
          3. 3.2.4.3.3. Mailing list and RSS feed
          4. 3.2.4.3.4. Commit database
        4. 3.2.4.4. Visual Source Safe (VSS)
          1. 3.2.4.4.1. Client availability
          2. 3.2.4.4.2. Web interfaces
          3. 3.2.4.4.3. Mailing list and RSS feed
          4. 3.2.4.4.4. Commit database
        5. 3.2.4.5. And the rest . . .
        6. 3.2.4.6. Summary
      5. 3.2.5. What to Put in Source Control
        1. 3.2.5.1. Documentation
        2. 3.2.5.2. Software configurations
        3. 3.2.5.3. Build tools
      6. 3.2.6. What Not to Put in Source Control
    3. 3.3. One-Step Build
      1. 3.3.1. Editing Live
      2. 3.3.2. Creating a Work Environment
        1. 3.3.2.1. Development
          1. 3.3.2.1.1. Personal development environments
        2. 3.3.2.2. Staging
          1. 3.3.2.2.1. Sub-staging
        3. 3.3.2.3. Production
        4. 3.3.2.4. Beta production
      3. 3.3.3. The Release Process
      4. 3.3.4. Build Tools
      5. 3.3.5. Release Management
      6. 3.3.6. What Not to Automate
        1. 3.3.6.1. Database schema changes
        2. 3.3.6.2. Software and hardware configuration changes
    4. 3.4. Issue Tracking
      1. 3.4.1. The Minimal Feature Set
      2. 3.4.2. Issue-Tracking Software
        1. 3.4.2.1. FogBugz
        2. 3.4.2.2. Mantis Bug Tracker
        3. 3.4.2.3. Request Tracker (RT)
        4. 3.4.2.4. Bugzilla
        5. 3.4.2.5. Trac
      3. 3.4.3. What to Track
        1. 3.4.3.1. Bugs
        2. 3.4.3.2. Features
        3. 3.4.3.3. Operations
        4. 3.4.3.4. Support requests
      4. 3.4.4. Issue Management Strategy
        1. 3.4.4.1. High-level categorization
      5. 3.4.5. CADT
    5. 3.5. Scaling the Development Model
    6. 3.6. Coding Standards
    7. 3.7. Testing
      1. 3.7.1. Regression Testing
      2. 3.7.2. Manual Testing
  6. 4. i18n, L10n, and Unicode
    1. 4.1. Internationalization and Localization
      1. 4.1.1. Internationalization in Web Applications
      2. 4.1.2. Localization in Web Applications
        1. 4.1.2.1. String substitution
        2. 4.1.2.2. Multiple template sets
        3. 4.1.2.3. Multiple frontends
    2. 4.2. Unicode in a Nutshell
    3. 4.3. Unicode Encodings
      1. 4.3.1. Code Points and Characters, Glyphs and Graphemes
      2. 4.3.2. Byte Order Mark
    4. 4.4. The UTF-8 Encoding
    5. 4.5. UTF-8 Web Applications
      1. 4.5.1. Handling Output
      2. 4.5.2. Handling Input
    6. 4.6. Using UTF-8 with PHP
    7. 4.7. Using UTF-8 with Other Languages
    8. 4.8. Using UTF-8 with MySQL
    9. 4.9. Using UTF-8 with Email
    10. 4.10. Using UTF-8 with JavaScript
    11. 4.11. Using UTF-8 with APIs
  7. 5. Data Integrity and Security
    1. 5.1. Data Integrity Policies
    2. 5.2. Good, Valid, and Invalid
    3. 5.3. Filtering UTF-8
    4. 5.4. Filtering Control Characters
    5. 5.5. Filtering HTML
      1. 5.5.1. Why Use HTML?
      2. 5.5.2. HTML Input Filtering
      3. 5.5.3. Blacklists and Whitelists
      4. 5.5.4. Balancing
      5. 5.5.5. Dealing with HTML
    6. 5.6. Cross-Site Scripting (XSS)
      1. 5.6.1. The Canonical Hole
      2. 5.6.2. User Input Holes
      3. 5.6.3. Tag and Bracket Balancing
      4. 5.6.4. Protocol Filtering
    7. 5.7. SQL Injection Attacks
      1. 5.7.1. Mitigating SQL Injection Attacks
      2. 5.7.2. Avoiding SQL Injection Attacks
  8. 6. Email
    1. 6.1. Receiving Email
    2. 6.2. Injecting Email into Your Application
      1. 6.2.1. An Alternative Approach
    3. 6.3. The MIME Format
    4. 6.4. Parsing Simple MIME Emails
    5. 6.5. Parsing UU Encoded Attachments
    6. 6.6. TNEF Attachments
    7. 6.7. Wireless Carriers Hate You
    8. 6.8. Character Sets and Encodings
    9. 6.9. Recognizing Your Users
    10. 6.10. Unit Testing
  9. 7. Remote Services
    1. 7.1. Remote Services Club
    2. 7.2. Sockets
    3. 7.3. Using HTTP
      1. 7.3.1. The HTTP Request and Response Cycle
      2. 7.3.2. HTTP Authentication
      3. 7.3.3. Making an HTTP Request
    4. 7.4. Remote Services Redundancy
    5. 7.5. Asynchronous Systems
    6. 7.6. Exchanging XML
      1. 7.6.1. Parsing XML
      2. 7.6.2. REST
      3. 7.6.3. XML-RPC
      4. 7.6.4. SOAP
    7. 7.7. Lightweight Protocols
      1. 7.7.1. Memory Usage
      2. 7.7.2. Network Speed
      3. 7.7.3. Parsing Speed
      4. 7.7.4. Writing Speed
      5. 7.7.5. Downsides
      6. 7.7.6. Rolling Your Own
  10. 8. Bottlenecks
    1. 8.1. Identifying Bottlenecks
      1. 8.1.1. Application Areas by Software Component
      2. 8.1.2. Application Areas by Hardware Component
      3. 8.1.3. CPU Usage
      4. 8.1.4. Code Profiling
      5. 8.1.5. Opcode Caching
      6. 8.1.6. Speeding Up Templates
      7. 8.1.7. General Solutions
      8. 8.1.8. I/O
      9. 8.1.9. Disk I/O
      10. 8.1.10. Network I/O
      11. 8.1.11. Memory I/O
      12. 8.1.12. Memory and Swap
    2. 8.2. External Services and Black Boxes
      1. 8.2.1. Databases
      2. 8.2.2. Query Spot Checks
      3. 8.2.3. Query Profiling
      4. 8.2.4. Query and Index Optimization
      5. 8.2.5. Caching
      6. 8.2.6. Denormalization
  11. 9. Scaling Web Applications
    1. 9.1. The Scaling Myth
      1. 9.1.1. What Is Scalability?
      2. 9.1.2. Scaling a Hardware Platform
      3. 9.1.3. Vertical Scaling
      4. 9.1.4. Horizontal Scaling
      5. 9.1.5. Ongoing Work
      6. 9.1.6. Redundancy
    2. 9.2. Scaling the Network
      1. 9.2.1. Scaling PHP
    3. 9.3. Load Balancing
      1. 9.3.1. Load Balancing with Hardware
      2. 9.3.2. Load Balancing with Software
      3. 9.3.3. Layer 4
      4. 9.3.4. Layer 7
      5. 9.3.5. Huge-Scale Balancing
      6. 9.3.6. Balancing Non-HTTP Traffic
    4. 9.4. Scaling MySQL
      1. 9.4.1. Storage Backends
    5. 9.5. MyISAM
      1. 9.5.1. InnoDB
      2. 9.5.2. BDB
      3. 9.5.3. Heap
    6. 9.6. MySQL Replication
      1. 9.6.1. Master-Slave Replication
      2. 9.6.2. Tree Replication
      3. 9.6.3. Master-Master Replication
      4. 9.6.4. Replication Failure
      5. 9.6.5. Replication Lag
    7. 9.7. Database Partitioning
      1. 9.7.1. Clustering
      2. 9.7.2. Federation
    8. 9.8. Scaling Large Database
    9. 9.9. Scaling Storage
      1. 9.9.1. Filesystems
      2. 9.9.2. Protocols
      3. 9.9.3. RAID
      4. 9.9.4. Federation
      5. 9.9.5. Caching
      6. 9.9.6. Caching Data
      7. 9.9.7. Caching HTTP Requests
      8. 9.9.8. Scaling in a Nutshell
  12. 10. Statistics, Monitoring, and Alerting
    1. 10.1. Tracking Web Statistics
      1. 10.1.1. Server Logfiles
      2. 10.1.2. Analysis
      3. 10.1.3. Using Beacons
      4. 10.1.4. Spread
      5. 10.1.5. Load Balancers
      6. 10.1.6. Tracking Custom Metrics
    2. 10.2. Application Monitoring
      1. 10.2.1. Bandwidth Monitoring
      2. 10.2.2. Long-Term System Statistics
        1. 10.2.2.1. MySQL statistics
        2. 10.2.2.2. Apache statistics
        3. 10.2.2.3. memcached statistics
        4. 10.2.2.4. Squid statistics
      3. 10.2.3. Custom Visualizations
    3. 10.3. Alerting
      1. 10.3.1. Uptime Checks
      2. 10.3.2. Resource-Level Monitoring
      3. 10.3.3. Threshold Checks
      4. 10.3.4. Low-Watermark Checks
  13. 11. APIs
    1. 11.1. Data Feeds
      1. 11.1.1. RSS
      2. 11.1.2. RDF
      3. 11.1.3. Atom
      4. 11.1.4. The Others
      5. 11.1.5. Feed Auto-Discovery
      6. 11.1.6. Feed Templating
      7. 11.1.7. OPML
      8. 11.1.8. Feed Authentication
    2. 11.2. Mobile Content
      1. 11.2.1. The Wireless Application Protocol (WAP)
      2. 11.2.2. XHTML Mobile Profile
    3. 11.3. Web Services
    4. 11.4. API Transports
      1. 11.4.1. REST
      2. 11.4.2. XML-RPC
      3. 11.4.3. SOAP
      4. 11.4.4. Transport Abstraction
    5. 11.5. API Abuse
      1. 11.5.1. Monitoring with API Keys
      2. 11.5.2. Throttling
      3. 11.5.3. Caching
    6. 11.6. Authentication
      1. 11.6.1. None at All
      2. 11.6.2. Plain Text
      3. 11.6.3. Message Authentication Code (MAC)
      4. 11.6.4. Token-Based Systems
    7. 11.7. The Future
  14. Index
  15. About the Author
  16. Colophon
  17. Copyright

Product information

  • Title: Building Scalable Web Sites
  • Author(s): Cal Henderson
  • Release date: May 2006
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9780596102357