Clojure for Data Science

Book description

Statistics, big data, and machine learning for Clojure programmers

About This Book

  • Write code using Clojure to harness the power of your data
  • Discover the libraries and frameworks that will help you succeed
  • A practical guide to understanding how the Clojure programming language can be used to derive insights from data

Who This Book Is For

This book is aimed at developers who are already productive in Clojure but who are overwhelmed by the breadth and depth of understanding required to be effective in the field of data science. Whether you're tasked with delivering a specific analytics project or simply suspect that you could be deriving more value from your data, this book will inspire you with the opportunities?and inform you of the risks?that exist in data of all shapes and sizes.

What You Will Learn

  • Perform hypothesis testing and understand feature selection and statistical significance to interpret your results with confidence
  • Implement the core machine learning techniques of regression, classification, clustering and recommendation
  • Understand the importance of the value of simple statistics and distributions in exploratory data analysis
  • Scale algorithms to web-sized datasets efficiently using distributed programming models on Hadoop and Spark
  • Apply suitable analytic approaches for text, graph, and time series data
  • Interpret the terminology that you will encounter in technical papers
  • Import libraries from other JVM languages such as Java and Scala
  • Communicate your findings clearly and convincingly to nontechnical colleagues

In Detail

The term ?data science? has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist's diverse needs.

Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you'll see how to make use of Clojure's Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don't yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language's flexibility!

You'll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark's MapReduce and GraphX's BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models.

Above all, by following the explanations in this book, you'll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

Style and approach

This is a practical guide to data science that teaches theory by example through the libraries and frameworks accessible from the Clojure programming language.

Table of contents

  1. Clojure for Data Science
    1. Table of Contents
    2. Clojure for Data Science
    3. Credits
    4. About the Author
    5. Acknowledgments
    6. About the Reviewer
    7. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Statistics
      1. Downloading the sample code
      2. Running the examples
      3. Downloading the data
      4. Inspecting the data
      5. Data scrubbing
      6. Descriptive statistics
        1. The mean
        2. Interpreting mathematical notation
        3. The median
      7. Variance
      8. Quantiles
      9. Binning data
      10. Histograms
      11. The normal distribution
        1. The central limit theorem
      12. Poincaré's baker
        1. Generating distributions
      13. Skewness
        1. Quantile-quantile plots
      14. Comparative visualizations
        1. Box plots
        2. Cumulative distribution functions
      15. The importance of visualizations
        1. Visualizing electorate data
      16. Adding columns
        1. Adding derived columns
      17. Comparative visualizations of electorate data
      18. Visualizing the Russian election data
      19. Comparative visualizations
        1. Probability mass functions
        2. Scatter plots
        3. Scatter transparency
      20. Summary
    10. 2. Inference
      1. Introducing AcmeContent
      2. Download the sample code
      3. Load and inspect the data
      4. Visualizing the dwell times
      5. The exponential distribution
        1. The distribution of daily means
      6. The central limit theorem
      7. Standard error
      8. Samples and populations
      9. Confidence intervals
        1. Sample comparisons
        2. Bias
      10. Visualizing different populations
      11. Hypothesis testing
        1. Significance
      12. Testing a new site design
        1. Performing a z-test
        2. Student's t-distribution
        3. Degrees of freedom
      13. The t-statistic
      14. Performing the t-test
        1. Two-tailed tests
      15. One-sample t-test
      16. Resampling
      17. Testing multiple designs
        1. Calculating sample means
      18. Multiple comparisons
        1. Introducing the simulation
        2. Compile the simulation
      19. The browser simulation
      20. jStat
      21. B1
        1. Scalable Vector Graphics
      22. Plotting probability densities
      23. State and Reagent
        1. Updating state
        2. Binding the interface
      24. Simulating multiple tests
      25. The Bonferroni correction
      26. Analysis of variance
      27. The F-distribution
      28. The F-statistic
      29. The F-test
      30. Effect size
        1. Cohen's d
      31. Summary
    11. 3. Correlation
      1. About the data
      2. Inspecting the data
      3. Visualizing the data
      4. The log-normal distribution
        1. Visualizing correlation
        2. Jittering
      5. Covariance
      6. Pearson's correlation
        1. Sample r and population rho
      7. Hypothesis testing
      8. Confidence intervals
      9. Regression
        1. Linear equations
        2. Residuals
      10. Ordinary least squares
        1. Slope and intercept
        2. Interpretation
        3. Visualization
        4. Assumptions
      11. Goodness-of-fit and R-square
      12. Multiple linear regression
      13. Matrices
        1. Dimensions
        2. Vectors
        3. Construction
        4. Addition and scalar multiplication
        5. Matrix-vector multiplication
        6. Matrix-matrix multiplication
        7. Transposition
        8. The identity matrix
        9. Inversion
      14. The normal equation
        1. More features
      15. Multiple R-squared
      16. Adjusted R-squared
        1. Incanter's linear model
          1. The F-test of model significance
        2. Categorical and dummy variables
        3. Relative power
      17. Collinearity
        1. Multicollinearity
      18. Prediction
        1. The confidence interval of a prediction
        2. Model scope
        3. The final model
      19. Summary
    12. 4. Classification
      1. About the data
      2. Inspecting the data
      3. Comparisons with relative risk and odds
      4. The standard error of a proportion
        1. Estimation using bootstrapping
      5. The binomial distribution
        1. The standard error of a proportion formula
      6. Significance testing proportions
        1. Adjusting standard errors for large samples
      7. Chi-squared multiple significance testing
        1. Visualizing the categories
        2. The chi-squared test
        3. The chi-squared statistic
        4. The chi-squared test
      8. Classification with logistic regression
        1. The sigmoid function
        2. The logistic regression cost function
        3. Parameter optimization with gradient descent
        4. Gradient descent with Incanter
        5. Convexity
      9. Implementing logistic regression with Incanter
        1. Creating a feature matrix
        2. Evaluating the logistic regression classifier
        3. The confusion matrix
        4. The kappa statistic
      10. Probability
        1. Bayes theorem
        2. Bayes theorem with multiple predictors
      11. Naive Bayes classification
        1. Implementing a naive Bayes classifier
        2. Evaluating the naive Bayes classifier
          1. Comparing the logistic regression and naive Bayes approaches
      12. Decision trees
        1. Information
        2. Entropy
        3. Information gain
        4. Using information gain to identify the best predictor
        5. Recursively building a decision tree
        6. Using the decision tree for classification
        7. Evaluating the decision tree classifier
      13. Classification with clj-ml
        1. Loading data with clj-ml
        2. Building a decision tree in clj-ml
      14. Bias and variance
        1. Overfitting
        2. Cross-validation
        3. Addressing high bias
      15. Ensemble learning and random forests
        1. Bagging and boosting
      16. Saving the classifier to a file
      17. Summary
    13. 5. Big Data
      1. Downloading the code and data
        1. Inspecting the data
        2. Counting the records
      2. The reducers library
        1. Parallel folds with reducers
        2. Loading large files with iota
        3. Creating a reducers processing pipeline
        4. Curried reductions with reducers
        5. Statistical folds with reducers
        6. Associativity
        7. Calculating the mean using fold
        8. Calculating the variance using fold
      3. Mathematical folds with Tesser
        1. Calculating covariance with Tesser
        2. Commutativity
        3. Simple linear regression with Tesser
        4. Calculating a correlation matrix
      4. Multiple regression with gradient descent
        1. The gradient descent update rule
        2. The gradient descent learning rate
        3. Feature scaling
        4. Feature extraction
        5. Creating a custom Tesser fold
          1. Creating a matrix-sum fold
        6. Calculating the total model error
          1. Creating a matrix-mean fold
        7. Applying a single step of gradient descent
        8. Running iterative gradient descent
      5. Scaling gradient descent with Hadoop
        1. Gradient descent on Hadoop with Tesser and Parkour
          1. Parkour distributed sources and sinks
          2. Running a feature scale fold with Hadoop
          3. Running gradient descent with Hadoop
          4. Preparing our code for a Hadoop cluster
          5. Building an uberjar
          6. Submitting the uberjar to Hadoop
      6. Stochastic gradient descent
        1. Stochastic gradient descent with Parkour
          1. Defining a mapper
          2. Parkour shaping functions
          3. Defining a reducer
          4. Specifying Hadoop jobs with Parkour graph
          5. Chaining mappers and reducers with Parkour graph
      7. Summary
    14. 6. Clustering
      1. Downloading the data
      2. Extracting the data
      3. Inspecting the data
      4. Clustering text
        1. Set-of-words and the Jaccard index
        2. Tokenizing the Reuters files
          1. Applying the Jaccard index to documents
          2. The bag-of-words and Euclidean distance
        3. Representing text as vectors
        4. Creating a dictionary
      5. Creating term frequency vectors
        1. The vector space model and cosine distance
        2. Removing stop words
        3. Stemming
      6. Clustering with k-means and Incanter
        1. Clustering the Reuters documents
      7. Better clustering with TF-IDF
        1. Zipf's law
        2. Calculating the TF-IDF weight
        3. k-means clustering with TF-IDF
        4. Better clustering with n-grams
      8. Large-scale clustering with Mahout
        1. Converting text documents to a sequence file
        2. Using Parkour to create Mahout vectors
        3. Creating distributed unique IDs
        4. Distributed unique IDs with Hadoop
        5. Sharing data with the distributed cache
        6. Building Mahout vectors from input documents
      9. Running k-means clustering with Mahout
        1. Viewing k-means clustering results
        2. Interpreting the clustered output
      10. Cluster evaluation measures
        1. Inter-cluster density
        2. Intra-cluster density
        3. Calculating the root mean square error with Parkour
          1. Loading clustered points and centroids
        4. Calculating the cluster RMSE
        5. Determining optimal k with the elbow method
        6. Determining optimal k with the Dunn index
        7. Determining optimal k with the Davies-Bouldin index
      11. The drawbacks of k-means
        1. The Mahalanobis distance measure
      12. The curse of dimensionality
      13. Summary
    15. 7. Recommender Systems
      1. Download the code and data
      2. Inspect the data
      3. Parse the data
      4. Types of recommender systems
        1. Collaborative filtering
      5. Item-based and user-based recommenders
      6. Slope One recommenders
        1. Calculating the item differences
        2. Making recommendations
        3. Practical considerations for user and item recommenders
      7. Building a user-based recommender with Mahout
      8. k-nearest neighbors
      9. Recommender evaluation with Mahout
        1. Evaluating distance measures
          1. The Pearson correlation similarity
          2. Spearman's rank similarity
        2. Determining optimum neighborhood size
        3. Information retrieval statistics
          1. Precision
          2. Recall
        4. Mahout's information retrieval evaluator
          1. F-measure and the harmonic mean
          2. Fall-out
          3. Normalized discounted cumulative gain
          4. Plotting the information retrieval results
        5. Recommendation with Boolean preferences
          1. Implicit versus explicit feedback
      10. Probabilistic methods for large sets
        1. Testing set membership with Bloom filters
      11. Jaccard similarity for large sets with MinHash
        1. Reducing pair comparisons with locality-sensitive hashing
          1. Bucketing signatures
      12. Dimensionality reduction
        1. Plotting the Iris dataset
        2. Principle component analysis
        3. Singular value decomposition
      13. Large-scale machine learning with Apache Spark and MLlib
        1. Loading data with Sparkling
        2. Mapping data
        3. Distributed datasets and tuples
        4. Filtering data
        5. Persistence and caching
      14. Machine learning on Spark with MLlib
        1. Movie recommendations with alternating least squares
        2. ALS with Spark and MLlib
        3. Making predictions with ALS
        4. Evaluating ALS
        5. Calculating the sum of squared errors
      15. Summary
    16. 8. Network Analysis
      1. Download the data
        1. Inspecting the data
        2. Visualizing graphs with Loom
      2. Graph traversal with Loom
        1. The seven bridges of Königsberg
      3. Breadth-first and depth-first search
      4. Finding the shortest path
        1. Minimum spanning trees
        2. Subgraphs and connected components
        3. SCC and the bow-tie structure of the web
      5. Whole-graph analysis
      6. Scale-free networks
      7. Distributed graph computation with GraphX
        1. Creating RDGs with Glittering
        2. Measuring graph density with triangle counting
          1. GraphX partitioning strategies
        3. Running the built-in triangle counting algorithm
        4. Implement triangle counting with Glittering
          1. Step one – collecting neighbor IDs
          2. Steps two, three, and four – aggregate messages
          3. Step five – dividing the counts
        5. Running the custom triangle counting algorithm
        6. The Pregel API
        7. Connected components with the Pregel API
          1. Step one – map vertices
          2. Steps two and three – the message function
          3. Step four – update the attributes
          4. Step five – iterate to convergence
        8. Running connected components
        9. Calculating the size of the largest connected component
        10. Detecting communities with label propagation
          1. Step one – map vertices
          2. Step two – send the vertex attribute
          3. Step three – aggregate value
          4. Step four – vertex function
          5. Step five – set the maximum iterations count
        11. Running label propagation
        12. Measuring community influence using PageRank
        13. The flow formulation
          1. Implementing PageRank with Glittering
          2. Sort by highest influence
        14. Running PageRank to determine community influencers
      8. Summary
    17. 9. Time Series
      1. About the data
        1. Loading the Longley data
      2. Fitting curves with a linear model
      3. Time series decomposition
        1. Inspecting the airline data
          1. Visualizing the airline data
        2. Stationarity
        3. De-trending and differencing
      4. Discrete time models
        1. Random walks
        2. Autoregressive models
        3. Determining autocorrelation in AR models
        4. Moving-average models
        5. Determining autocorrelation in MA models
        6. Combining the AR and MA models
        7. Calculating partial autocorrelation
          1. Autocovariance
          2. PACF with Durbin-Levinson recursion
          3. Plotting partial autocorrelation
          4. Determining ARMA model order with ACF and PACF
        8. ACF and PACF of airline data
        9. Removing seasonality with differencing
      5. Maximum likelihood estimation
        1. Calculating the likelihood
        2. Estimating the maximum likelihood
          1. Nelder-Mead optimization with Apache Commons Math
        3. Identifying better models with Akaike Information Criterion
      6. Time series forecasting
        1. Forecasting with Monte Carlo simulation
      7. Summary
    18. 10. Visualization
      1. Download the code and data
      2. Exploratory data visualization
        1. Representing a two-dimensional histogram
      3. Using Quil for visualization
        1. Drawing to the sketch window
        2. Quil's coordinate system
        3. Plotting the grid
        4. Specifying the fill color
        5. Color and fill
        6. Outputting an image file
      4. Visualization for communication
        1. Visualizing wealth distribution
        2. Bringing data to life with Quil
        3. Drawing bars of differing widths
        4. Adding a title and axis labels
        5. Improving the clarity with illustrations
        6. Adding text to the bars
        7. Incorporating additional data
        8. Drawing complex shapes
        9. Drawing curves
        10. Plotting compound charts
        11. Output to PDF
      5. Summary
    19. Index

Product information

  • Title: Clojure for Data Science
  • Author(s): Henry Garner
  • Release date: September 2015
  • Publisher(s): Packt Publishing
  • ISBN: 9781784397180