Mining the Social Web, 2nd Edition

Book description

How can you tap into the wealth of social web data to discover who’s making connections with whom, what they’re talking about, and where they’re located? With this expanded and thoroughly revised edition, you’ll learn how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs.

  • Employ the Natural Language Toolkit, NetworkX, and other scientific computing tools to mine popular social web sites
  • Apply advanced text-mining techniques, such as clustering and TF-IDF, to extract meaning from human language data
  • Bootstrap interest graphs from GitHub by discovering affinities among people, programming languages, and coding projects
  • Build interactive visualizations with D3.js, an extraordinarily flexible HTML5 and JavaScript toolkit
  • Take advantage of more than two-dozen Twitter recipes, presented in O’Reilly’s popular "problem/solution/discussion" cookbook format

The example code for this unique data science book is maintained in a public GitHub repository. It’s designed to be easily accessible through a turnkey virtual machine that facilitates interactive learning with an easy-to-use collection of IPython Notebooks.

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. Preface
    1. README.1st
    2. Managing Your Expectations
    3. Python-Centric Technology
    4. Improvements Specific to the Second Edition
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments for the Second Edition
    10. Acknowledgments from the First Edition
  3. I. A Guided Tour of the Social Web
    1. Prelude
    2. 1. Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More
      1. Overview
      2. Why Is Twitter All the Rage?
      3. Exploring Twitter’s API
        1. Fundamental Twitter Terminology
        2. Creating a Twitter API Connection
        3. Exploring Trending Topics
        4. Searching for Tweets
      4. Analyzing the 140 Characters
        1. Extracting Tweet Entities
        2. Analyzing Tweets and Tweet Entities with Frequency Analysis
        3. Computing the Lexical Diversity of Tweets
        4. Examining Patterns in Retweets
        5. Visualizing Frequency Data with Histograms
      5. Closing Remarks
      6. Recommended Exercises
      7. Online Resources
    3. 2. Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More
      1. Overview
      2. Exploring Facebook’s Social Graph API
        1. Understanding the Social Graph API
        2. Understanding the Open Graph Protocol
      3. Analyzing Social Graph Connections
        1. Analyzing Facebook Pages
          1. Analyzing this book’s Facebook page
          2. Analyzing Coke vs Pepsi Facebook pages
        2. Examining Friendships
          1. Analyzing things your friends “like”
          2. Analyzing mutual friendships with directed graphs
          3. Visualizing directed graphs of mutual friendships
      4. Closing Remarks
      5. Recommended Exercises
      6. Online Resources
    4. 3. Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More
      1. Overview
      2. Exploring the LinkedIn API
        1. Making LinkedIn API Requests
        2. Downloading LinkedIn Connections as a CSV File
      3. Crash Course on Clustering Data
        1. Clustering Enhances User Experiences
        2. Normalizing Data to Enable Analysis
          1. Normalizing and counting companies
          2. Normalizing and counting job titles
          3. Normalizing and counting locations
          4. Visualizing locations with cartograms
        3. Measuring Similarity
        4. Clustering Algorithms
          1. Greedy clustering
            1. Runtime analysis
          2. Hierarchical clustering
          3. k-means clustering
          4. Visualizing geographic clusters with Google Earth
      4. Closing Remarks
      5. Recommended Exercises
      6. Online Resources
    5. 4. Mining Google+: Computing Document Similarity, Extracting Collocations, and More
      1. Overview
      2. Exploring the Google+ API
        1. Making Google+ API Requests
      3. A Whiz-Bang Introduction to TF-IDF
        1. Term Frequency
        2. Inverse Document Frequency
        3. TF-IDF
      4. Querying Human Language Data with TF-IDF
        1. Introducing the Natural Language Toolkit
        2. Applying TF-IDF to Human Language
        3. Finding Similar Documents
          1. The theory behind vector space models and cosine similarity
          2. Clustering posts with cosine similarity
          3. Visualizing document similarity with a matrix diagram
        4. Analyzing Bigrams in Human Language
          1. Contingency tables and scoring functions
        5. Reflections on Analyzing Human Language Data
      5. Closing Remarks
      6. Recommended Exercises
      7. Online Resources
    6. 5. Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More
      1. Overview
      2. Scraping, Parsing, and Crawling the Web
        1. Breadth-First Search in Web Crawling
      3. Discovering Semantics by Decoding Syntax
        1. Natural Language Processing Illustrated Step-by-Step
        2. Sentence Detection in Human Language Data
        3. Document Summarization
          1. Analysis of Luhn’s summarization algorithm
      4. Entity-Centric Analysis: A Paradigm Shift
        1. Gisting Human Language Data
      5. Quality of Analytics for Processing Human Language Data
      6. Closing Remarks
      7. Recommended Exercises
      8. Online Resources
    7. 6. Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
      1. Overview
      2. Obtaining and Processing a Mail Corpus
        1. A Primer on Unix Mailboxes
        2. Getting the Enron Data
        3. Converting a Mail Corpus to a Unix Mailbox
        4. Converting Unix Mailboxes to JSON
        5. Importing a JSONified Mail Corpus into MongoDB
          1. The MongoDB shell
        6. Programmatically Accessing MongoDB with Python
      3. Analyzing the Enron Corpus
        1. Querying by Date/Time Range
        2. Analyzing Patterns in Sender/Recipient Communications
        3. Writing Advanced Queries
        4. Searching Emails by Keywords
      4. Discovering and Visualizing Time-Series Trends
      5. Analyzing Your Own Mail Data
        1. Accessing Your Gmail with OAuth
        2. Fetching and Parsing Email Messages with IMAP
        3. Visualizing Patterns in GMail with the “Graph Your Inbox” Chrome Extension
      6. Closing Remarks
      7. Recommended Exercises
      8. Online Resources
    8. 7. Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More
      1. Overview
      2. Exploring GitHub’s API
        1. Creating a GitHub API Connection
        2. Making GitHub API Requests
      3. Modeling Data with Property Graphs
      4. Analyzing GitHub Interest Graphs
        1. Seeding an Interest Graph
        2. Computing Graph Centrality Measures
        3. Extending the Interest Graph with “Follows” Edges for Users
          1. Application of centrality measures
          2. Adding more repositories to the interest graph
          3. Computational Considerations
        4. Using Nodes as Pivots for More Efficient Queries
        5. Visualizing Interest Graphs
      5. Closing Remarks
      6. Recommended Exercises
      7. Online Resources
    9. 8. Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More
      1. Overview
      2. Microformats: Easy-to-Implement Metadata
        1. Geocoordinates: A Common Thread for Just About Anything
        2. Using Recipe Data to Improve Online Matchmaking
          1. Retrieving recipe reviews
        3. Accessing LinkedIn’s 200 Million Online Résumés
      3. From Semantic Markup to Semantic Web: A Brief Interlude
      4. The Semantic Web: An Evolutionary Revolution
        1. Man Cannot Live on Facts Alone
          1. Open-world versus closed-world assumptions
        2. Inferencing About an Open World
      5. Closing Remarks
      6. Recommended Exercises
      7. Online Resources
  4. II. Twitter Cookbook
    1. 9. Twitter Cookbook
      1. Accessing Twitter’s API for Development Purposes
        1. Problem
        2. Solution
        3. Discussion
      2. Doing the OAuth Dance to Access Twitter’s API for Production Purposes
        1. Problem
        2. Solution
        3. Discussion
      3. Discovering the Trending Topics
        1. Problem
        2. Solution
        3. Discussion
      4. Searching for Tweets
        1. Problem
        2. Solution
        3. Discussion
      5. Constructing Convenient Function Calls
        1. Problem
        2. Solution
        3. Discussion
      6. Saving and Restoring JSON Data with Text Files
        1. Problem
        2. Solution
        3. Discussion
      7. Saving and Accessing JSON Data with MongoDB
        1. Problem
        2. Solution
        3. Discussion
      8. Sampling the Twitter Firehose with the Streaming API
        1. Problem
        2. Solution
        3. Discussion
      9. Collecting Time-Series Data
        1. Problem
        2. Solution
        3. Discussion
      10. Extracting Tweet Entities
        1. Problem
        2. Solution
        3. Discussion
      11. Finding the Most Popular Tweets in a Collection of Tweets
        1. Problem
        2. Solution
        3. Discussion
      12. Finding the Most Popular Tweet Entities in a Collection of Tweets
        1. Problem
        2. Solution
        3. Discussion
      13. Tabulating Frequency Analysis
        1. Problem
        2. Solution
        3. Discussion
      14. Finding Users Who Have Retweeted a Status
        1. Problem
        2. Solution
        3. Discussion
      15. Extracting a Retweet’s Attribution
        1. Problem
        2. Solution
        3. Discussion
      16. Making Robust Twitter Requests
        1. Problem
        2. Solution
        3. Discussion
      17. Resolving User Profile Information
        1. Problem
        2. Solution
        3. Discussion
      18. Extracting Tweet Entities from Arbitrary Text
        1. Problem
        2. Solution
        3. Discussion
      19. Getting All Friends or Followers for a User
        1. Problem
        2. Solution
        3. Discussion
      20. Analyzing a User’s Friends and Followers
        1. Problem
        2. Solution
        3. Discussion
      21. Harvesting a User’s Tweets
        1. Problem
        2. Solution
        3. Discussion
      22. Crawling a Friendship Graph
        1. Problem
        2. Solution
        3. Discussion
      23. Analyzing Tweet Content
        1. Problem
        2. Solution
        3. Discussion
      24. Summarizing Link Targets
        1. Problem
        2. Solution
        3. Discussion
      25. Analyzing a User’s Favorite Tweets
        1. Problem
        2. Solution
        3. Discussion
      26. Closing Remarks
      27. Recommended Exercises
      28. Online Resources
  5. III. Appendixes
    1. A. Information About This Book’s Virtual Machine Experience
    2. B. OAuth Primer
      1. Overview
        1. OAuth 1.0A
        2. OAuth 2.0
    3. C. Python and IPython Notebook Tips & Tricks
  6. Index
  7. About the Author
  8. Colophon
  9. Copyright

Product information

  • Title: Mining the Social Web, 2nd Edition
  • Author(s): Matthew A. Russell
  • Release date: October 2013
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449367619