Learning Data Mining with Python - Second Edition

Book description

Harness the power of Python to develop data mining applications, analyze data, delve into machine learning, explore object detection using Deep Neural Networks, and create insightful predictive models.

About This Book

  • Use a wide variety of Python libraries for practical data mining purposes.
  • Learn how to find, manipulate, analyze, and visualize data using Python.
  • Step-by-step instructions on data mining techniques with Python that have real-world applications.

Who This Book Is For

If you are a Python programmer who wants to get started with data mining, then this book is for you. If you are a data analyst who wants to leverage the power of Python to perform data mining efficiently, this book will also help you. No previous experience with data mining is expected.

What You Will Learn

  • Apply data mining concepts to real-world problems
  • Predict the outcome of sports matches based on past results
  • Determine the author of a document based on their writing style
  • Use APIs to download datasets from social media and other online services
  • Find and extract good features from difficult datasets
  • Create models that solve real-world problems
  • Design and develop data mining applications using a variety of datasets
  • Perform object detection in images using Deep Neural Networks
  • Find meaningful insights from your data through intuitive visualizations
  • Compute on big data, including real-time data from the internet

In Detail

This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. This book covers a large number of libraries available in Python, including the Jupyter Notebook, pandas, scikit-learn, and NLTK.

You will gain hands on experience with complex data types including text, images, and graphs. You will also discover object detection using Deep Neural Networks, which is one of the big, difficult areas of machine learning right now.

With restructured examples and code samples updated for the latest edition of Python, each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will have great insights into using Python for data mining and understanding of the algorithms as well as implementations.

Style and approach

This book will be your comprehensive guide to learning the various data mining techniques and implementing them in Python. A variety of real-world datasets is used to explain data mining techniques in a very crisp and easy to understand manner.

Table of contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Errata
      3. Piracy
      4. Questions
  2. Getting Started with Data Mining
    1. Introducing data mining
    2. Using Python and the Jupyter Notebook
      1. Installing Python
      2. Installing Jupyter Notebook
      3. Installing scikit-learn
    3. A simple affinity analysis example
      1. What is affinity analysis?
    4. Product recommendations
      1. Loading the dataset with NumPy
        1. Downloading the example code
      2. Implementing a simple ranking of rules
      3. Ranking to find the best rules
    5. A simple classification example
    6. What is classification?
      1. Loading and preparing the dataset
      2. Implementing the OneR algorithm
      3. Testing the algorithm
    7. Summary
  3. Classifying with scikit-learn Estimators
    1. scikit-learn estimators
      1. Nearest neighbors
      2. Distance metrics
      3. Loading the dataset
      4. Moving towards a standard workflow
      5. Running the algorithm
      6. Setting parameters
    2. Preprocessing
      1. Standard pre-processing
      2. Putting it all together
    3. Pipelines
    4. Summary
  4. Predicting Sports Winners with Decision Trees
    1. Loading the dataset
      1. Collecting the data
      2. Using pandas to load the dataset
      3. Cleaning up the dataset
      4. Extracting new features
    2. Decision trees
      1. Parameters in decision trees
      2. Using decision trees
    3. Sports outcome prediction
      1. Putting it all together
    4. Random forests
      1. How do ensembles work?
      2. Setting parameters in Random Forests
      3. Applying random forests
      4. Engineering new features
    5. Summary
  5. Recommending Movies Using Affinity Analysis
    1. Affinity analysis
      1. Algorithms for affinity analysis
      2. Overall methodology
    2. Dealing with the movie recommendation problem
      1. Obtaining the dataset
        1. Loading with pandas
        2. Sparse data formats
    3. Understanding the Apriori algorithm and its implementation
      1. Looking into the basics of the Apriori algorithm
      2. Implementing the Apriori algorithm
      3. Extracting association rules
      4. Evaluating the association rules
    4. Summary
  6. Features and scikit-learn Transformers
    1. Feature extraction
      1. Representing reality in models
      2. Common feature patterns
      3. Creating good features
    2. Feature selection
      1. Selecting the best individual features
    3. Feature creation
    4. Principal Component Analysis
    5. Creating your own transformer
      1. The transformer API
      2. Implementing a Transformer
    6. Unit testing
    7. Putting it all together
    8. Summary
  7. Social Media Insight using Naive Bayes
    1. Disambiguation
    2. Downloading data from a social network
      1. Loading and classifying the dataset
      2. Creating a replicable dataset from Twitter
    3. Text transformers
      1. Bag-of-words models
      2. n-gram features
      3. Other text features
    4. Naive Bayes
      1. Understanding Bayes' theorem
      2. Naive Bayes algorithm
      3. How it works
    5. Applying of Naive Bayes
      1. Extracting word counts
      2. Converting dictionaries to a matrix
      3. Putting it all together
      4. Evaluation using the F1-score
    6. Getting useful features from models
    7. Summary
  8. Follow Recommendations Using Graph Mining
    1. Loading the dataset
      1. Classifying with an existing model
    2. Getting follower information from Twitter
      1. Building the network
    3. Creating a graph
      1. Creating a similarity graph
    4. Finding subgraphs
      1. Connected components
      2. Optimizing criteria
    5. Summary
  9. Beating CAPTCHAs with Neural Networks
    1. Artificial neural networks
      1. An introduction to neural networks
    2. Creating the dataset
      1. Drawing basic CAPTCHAs
      2. Splitting the image into individual letters
      3. Creating a training dataset
    3. Training and classifying
      1. Back-propagation
    4. Predicting words
      1. Improving accuracy using a dictionary
      2. Ranking mechanisms for word similarity
      3. Putting it all together
    5. Summary
  10. Authorship Attribution
    1. Attributing documents to authors
      1. Applications and use cases
      2. Authorship attribution
    2. Getting the data
    3. Using function words
      1. Counting function words
      2. Classifying with function words
    4. Support Vector Machines
      1. Classifying with SVMs
      2. Kernels
    5. Character n-grams
      1. Extracting character n-grams
    6. The Enron dataset
      1. Accessing the Enron dataset
      2. Creating a dataset loader
    7. Putting it all together
    8. Evaluation
    9. Summary
  11. Clustering News Articles
    1. Trending topic discovery
      1. Using a web API to get data
      2. Reddit as a data source
      3. Getting the data
    2. Extracting text from arbitrary websites
      1. Finding the stories in arbitrary websites
      2. Extracting the content
    3. Grouping news articles
    4. The k-means algorithm
      1. Evaluating the results
      2. Extracting topic information from clusters
      3. Using clustering algorithms as transformers
    5. Clustering ensembles
      1. Evidence accumulation
      2. How it works
      3. Implementation
    6. Online learning
      1. Implementation
    7. Summary
  12. Object Detection in Images using Deep Neural Networks
    1. Object classification
      1. Use cases
    2. Application scenario
    3. Deep neural networks
      1. Intuition
      2. Implementing deep neural networks
    4. An Introduction to TensorFlow
    5. Using Keras
      1. Convolutional Neural Networks
    6. GPU optimization
      1. When to use GPUs for computation
      2. Running our code on a GPU
      3. Setting up the environment
    7. Application
      1. Getting the data
      2. Creating the neural network
      3. Putting it all together
    8. Summary
  13. Working with Big Data
    1. Big data
      1. Applications of big data
    2. MapReduce
      1. The intuition behind MapReduce
        1. A word count example
      2. Hadoop MapReduce
    3. Applying MapReduce
      1. Getting the data
    4. Naive Bayes prediction
      1. The mrjob package
    5. Extracting the blog posts
    6. Training Naive Bayes
    7. Putting it all together
    8. Training on Amazon's EMR infrastructure
    9. Summary
  14. Next Steps...
    1. Getting Started with Data Mining
      1. Scikit-learn tutorials
      2. Extending the Jupyter Notebook
      3. More datasets
      4. Other Evaluation Metrics
      5. More application ideas
    2. Classifying with scikit-learn Estimators
      1. Scalability with the nearest neighbor
      2. More complex pipelines
      3. Comparing classifiers
      4. Automated Learning
    3. Predicting Sports Winners with Decision Trees
      1. More complex features
      2. Dask
      3. Research
    4. Recommending Movies Using Affinity Analysis
      1. New datasets
      2. The Eclat algorithm
      3. Collaborative Filtering
    5. Extracting Features with Transformers
      1. Adding noise
      2. Vowpal Wabbit
      3. word2vec
    6. Social Media Insight Using Naive Bayes
      1. Spam detection
      2. Natural language processing and part-of-speech tagging
    7. Discovering Accounts to Follow Using Graph Mining
      1. More complex algorithms
        1. NetworkX
    8. Beating CAPTCHAs with Neural Networks
      1. Better (worse?) CAPTCHAs
      2. Deeper networks
      3. Reinforcement learning
    9. Authorship Attribution
      1. Increasing the sample size
      2. Blogs dataset
      3. Local n-grams
    10. Clustering News Articles
      1. Clustering Evaluation
      2. Temporal analysis
      3. Real-time clusterings
    11. Classifying Objects in Images Using Deep Learning
      1. Mahotas
      2. Magenta
    12. Working with Big Data
      1. Courses on Hadoop
      2. Pydoop
      3. Recommendation engine
      4. W.I.L.L
    13. More resources
      1. Kaggle competitions
        1. Coursera

Product information

  • Title: Learning Data Mining with Python - Second Edition
  • Author(s): Robert Layton
  • Release date: April 2017
  • Publisher(s): Packt Publishing
  • ISBN: 9781787126787