Python Data Science Handbook

Book description

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools.

Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.

With this handbook, you’ll learn how to use:

  • IPython and Jupyter: provide computational environments for data scientists using Python
  • NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python
  • Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python
  • Matplotlib: includes capabilities for a flexible range of data visualizations in Python
  • Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What Is Data Science?
    2. Who Is This Book For?
    3. Why Python?
      1. Python 2 Versus Python 3
    4. Outline of This Book
    5. Using Code Examples
    6. Installation Considerations
    7. Conventions Used in This Book
    8. O’Reilly Safari
    9. How to Contact Us
  2. 1. IPython: Beyond Normal Python
    1. Shell or Notebook?
      1. Launching the IPython Shell
      2. Launching the Jupyter Notebook
    2. Help and Documentation in IPython
      1. Accessing Documentation with ?
      2. Accessing Source Code with ??
      3. Exploring Modules with Tab Completion
    3. Keyboard Shortcuts in the IPython Shell
      1. Navigation Shortcuts
      2. Text Entry Shortcuts
      3. Command History Shortcuts
      4. Miscellaneous Shortcuts
    4. IPython Magic Commands
      1. Pasting Code Blocks: %paste and %cpaste
      2. Running External Code: %run
      3. Timing Code Execution: %timeit
      4. Help on Magic Functions: ?, %magic, and %lsmagic
    5. Input and Output History
      1. IPython’s In and Out Objects
      2. Underscore Shortcuts and Previous Outputs
      3. Suppressing Output
      4. Related Magic Commands
    6. IPython and Shell Commands
      1. Quick Introduction to the Shell
      2. Shell Commands in IPython
      3. Passing Values to and from the Shell
    7. Shell-Related Magic Commands
    8. Errors and Debugging
      1. Controlling Exceptions: %xmode
      2. Debugging: When Reading Tracebacks Is Not Enough
    9. Profiling and Timing Code
      1. Timing Code Snippets: %timeit and %time
      2. Profiling Full Scripts: %prun
      3. Line-by-Line Profiling with %lprun
      4. Profiling Memory Use: %memit and %mprun
    10. More IPython Resources
      1. Web Resources
      2. Books
  3. 2. Introduction to NumPy
    1. Understanding Data Types in Python
      1. A Python Integer Is More Than Just an Integer
      2. A Python List Is More Than Just a List
      3. Fixed-Type Arrays in Python
      4. Creating Arrays from Python Lists
      5. Creating Arrays from Scratch
      6. NumPy Standard Data Types
    2. The Basics of NumPy Arrays
      1. NumPy Array Attributes
      2. Array Indexing: Accessing Single Elements
      3. Array Slicing: Accessing Subarrays
      4. Reshaping of Arrays
      5. Array Concatenation and Splitting
    3. Computation on NumPy Arrays: Universal Functions
      1. The Slowness of Loops
      2. Introducing UFuncs
      3. Exploring NumPy’s UFuncs
      4. Advanced Ufunc Features
      5. Ufuncs: Learning More
    4. Aggregations: Min, Max, and Everything in Between
      1. Summing the Values in an Array
      2. Minimum and Maximum
      3. Example: What Is the Average Height of US Presidents?
    5. Computation on Arrays: Broadcasting
      1. Introducing Broadcasting
      2. Rules of Broadcasting
      3. Broadcasting in Practice
    6. Comparisons, Masks, and Boolean Logic
      1. Example: Counting Rainy Days
      2. Comparison Operators as ufuncs
      3. Working with Boolean Arrays
      4. Boolean Arrays as Masks
    7. Fancy Indexing
      1. Exploring Fancy Indexing
      2. Combined Indexing
      3. Example: Selecting Random Points
      4. Modifying Values with Fancy Indexing
      5. Example: Binning Data
    8. Sorting Arrays
      1. Fast Sorting in NumPy: np.sort and np.argsort
      2. Partial Sorts: Partitioning
      3. Example: k-Nearest Neighbors
    9. Structured Data: NumPy’s Structured Arrays
      1. Creating Structured Arrays
      2. More Advanced Compound Types
      3. RecordArrays: Structured Arrays with a Twist
      4. On to Pandas
  4. 3. Data Manipulation with Pandas
    1. Installing and Using Pandas
    2. Introducing Pandas Objects
      1. The Pandas Series Object
      2. The Pandas DataFrame Object
      3. The Pandas Index Object
    3. Data Indexing and Selection
      1. Data Selection in Series
      2. Data Selection in DataFrame
    4. Operating on Data in Pandas
      1. Ufuncs: Index Preservation
      2. UFuncs: Index Alignment
      3. Ufuncs: Operations Between DataFrame and Series
    5. Handling Missing Data
      1. Trade-Offs in Missing Data Conventions
      2. Missing Data in Pandas
      3. Operating on Null Values
    6. Hierarchical Indexing
      1. A Multiply Indexed Series
      2. Methods of MultiIndex Creation
      3. Indexing and Slicing a MultiIndex
      4. Rearranging Multi-Indices
      5. Data Aggregations on Multi-Indices
    7. Combining Datasets: Concat and Append
      1. Recall: Concatenation of NumPy Arrays
      2. Simple Concatenation with pd.concat
    8. Combining Datasets: Merge and Join
      1. Relational Algebra
      2. Categories of Joins
      3. Specification of the Merge Key
      4. Specifying Set Arithmetic for Joins
      5. Overlapping Column Names: The suffixes Keyword
      6. Example: US States Data
    9. Aggregation and Grouping
      1. Planets Data
      2. Simple Aggregation in Pandas
      3. GroupBy: Split, Apply, Combine
    10. Pivot Tables
      1. Motivating Pivot Tables
      2. Pivot Tables by Hand
      3. Pivot Table Syntax
      4. Example: Birthrate Data
    11. Vectorized String Operations
      1. Introducing Pandas String Operations
      2. Tables of Pandas String Methods
      3. Example: Recipe Database
    12. Working with Time Series
      1. Dates and Times in Python
      2. Pandas Time Series: Indexing by Time
      3. Pandas Time Series Data Structures
      4. Frequencies and Offsets
      5. Resampling, Shifting, and Windowing
      6. Where to Learn More
      7. Example: Visualizing Seattle Bicycle Counts
    13. High-Performance Pandas: eval() and query()
      1. Motivating query() and eval(): Compound Expressions
      2. pandas.eval() for Efficient Operations
      3. DataFrame.eval() for Column-Wise Operations
      4. DataFrame.query() Method
      5. Performance: When to Use These Functions
    14. Further Resources
  5. 4. Visualization with Matplotlib
    1. General Matplotlib Tips
      1. Importing matplotlib
      2. Setting Styles
      3. show() or No show()? How to Display Your Plots
      4. Saving Figures to File
    2. Two Interfaces for the Price of One
    3. Simple Line Plots
      1. Adjusting the Plot: Line Colors and Styles
      2. Adjusting the Plot: Axes Limits
      3. Labeling Plots
    4. Simple Scatter Plots
      1. Scatter Plots with plt.plot
      2. Scatter Plots with plt.scatter
      3. plot Versus scatter: A Note on Efficiency
    5. Visualizing Errors
      1. Basic Errorbars
      2. Continuous Errors
    6. Density and Contour Plots
      1. Visualizing a Three-Dimensional Function
    7. Histograms, Binnings, and Density
      1. Two-Dimensional Histograms and Binnings
    8. Customizing Plot Legends
      1. Choosing Elements for the Legend
      2. Legend for Size of Points
      3. Multiple Legends
    9. Customizing Colorbars
      1. Customizing Colorbars
      2. Example: Handwritten Digits
    10. Multiple Subplots
      1. plt.axes: Subplots by Hand
      2. plt.subplot: Simple Grids of Subplots
      3. plt.subplots: The Whole Grid in One Go
      4. plt.GridSpec: More Complicated Arrangements
    11. Text and Annotation
      1. Example: Effect of Holidays on US Births
      2. Transforms and Text Position
      3. Arrows and Annotation
    12. Customizing Ticks
      1. Major and Minor Ticks
      2. Hiding Ticks or Labels
      3. Reducing or Increasing the Number of Ticks
      4. Fancy Tick Formats
      5. Summary of Formatters and Locators
    13. Customizing Matplotlib: Configurations and Stylesheets
      1. Plot Customization by Hand
      2. Changing the Defaults: rcParams
      3. Stylesheets
    14. Three-Dimensional Plotting in Matplotlib
      1. Three-Dimensional Points and Lines
      2. Three-Dimensional Contour Plots
      3. Wireframes and Surface Plots
      4. Surface Triangulations
    15. Geographic Data with Basemap
      1. Map Projections
      2. Drawing a Map Background
      3. Plotting Data on Maps
      4. Example: California Cities
      5. Example: Surface Temperature Data
    16. Visualization with Seaborn
      1. Seaborn Versus Matplotlib
      2. Exploring Seaborn Plots
      3. Example: Exploring Marathon Finishing Times
    17. Further Resources
      1. Matplotlib Resources
      2. Other Python Graphics Libraries
  6. 5. Machine Learning
    1. What Is Machine Learning?
      1. Categories of Machine Learning
      2. Qualitative Examples of Machine Learning Applications
      3. Summary
    2. Introducing Scikit-Learn
      1. Data Representation in Scikit-Learn
      2. Scikit-Learn’s Estimator API
      3. Application: Exploring Handwritten Digits
      4. Summary
    3. Hyperparameters and Model Validation
      1. Thinking About Model Validation
      2. Selecting the Best Model
      3. Learning Curves
      4. Validation in Practice: Grid Search
      5. Summary
    4. Feature Engineering
      1. Categorical Features
      2. Text Features
      3. Image Features
      4. Derived Features
      5. Imputation of Missing Data
      6. Feature Pipelines
    5. In Depth: Naive Bayes Classification
      1. Bayesian Classification
      2. Gaussian Naive Bayes
      3. Multinomial Naive Bayes
      4. When to Use Naive Bayes
    6. In Depth: Linear Regression
      1. Simple Linear Regression
      2. Basis Function Regression
      3. Regularization
      4. Example: Predicting Bicycle Traffic
    7. In-Depth: Support Vector Machines
      1. Motivating Support Vector Machines
      2. Support Vector Machines: Maximizing the Margin
      3. Example: Face Recognition
      4. Support Vector Machine Summary
    8. In-Depth: Decision Trees and Random Forests
      1. Motivating Random Forests: Decision Trees
      2. Ensembles of Estimators: Random Forests
      3. Random Forest Regression
      4. Example: Random Forest for Classifying Digits
      5. Summary of Random Forests
    9. In Depth: Principal Component Analysis
      1. Introducing Principal Component Analysis
      2. PCA as Noise Filtering
      3. Example: Eigenfaces
      4. Principal Component Analysis Summary
    10. In-Depth: Manifold Learning
      1. Manifold Learning: “HELLO”
      2. Multidimensional Scaling (MDS)
      3. MDS as Manifold Learning
      4. Nonlinear Embeddings: Where MDS Fails
      5. Nonlinear Manifolds: Locally Linear Embedding
      6. Some Thoughts on Manifold Methods
      7. Example: Isomap on Faces
      8. Example: Visualizing Structure in Digits
    11. In Depth: k-Means Clustering
      1. Introducing k-Means
      2. k-Means Algorithm: Expectation–Maximization
      3. Examples
    12. In Depth: Gaussian Mixture Models
      1. Motivating GMM: Weaknesses of k-Means
      2. Generalizing E–M: Gaussian Mixture Models
      3. GMM as Density Estimation
      4. Example: GMM for Generating New Data
    13. In-Depth: Kernel Density Estimation
      1. Motivating KDE: Histograms
      2. Kernel Density Estimation in Practice
      3. Example: KDE on a Sphere
      4. Example: Not-So-Naive Bayes
    14. Application: A Face Detection Pipeline
      1. HOG Features
      2. HOG in Action: A Simple Face Detector
      3. Caveats and Improvements
    15. Further Machine Learning Resources
      1. Machine Learning in Python
      2. General Machine Learning
  7. Index

Product information

  • Title: Python Data Science Handbook
  • Author(s): Jake VanderPlas
  • Release date: November 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491912058