Book description
Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learning
About This Book
Build data science and data engineering solutions with ease
An in-depth look at each stage of the data analysis process — from reading and collecting data to distributed analytics
Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulations, and source code
Who This Book Is For
This learning path is perfect for those who are comfortable with Scala programming and now want to enter the field of data science. Some knowledge of statistics is expected.
What You Will Learn
Transfer and filter tabular data to extract features for machine learning
Read, clean, transform, and write data to both SQL and NoSQL databases
Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations
Load data from HDFS and HIVE with ease
Run streaming and graph analytics in Spark for exploratory analysis
Bundle and scale up Spark jobs by deploying them into a variety of cluster managers
Build dynamic workflows for scientific computing
Leverage open source libraries to extract patterns from time series
Master probabilistic models for sequential data
In Detail
Scala is especially good for analyzing large sets of data as the scale of the task doesn’t have any significant impact on performance. Scala’s powerful functional libraries can interact with databases and build scalable frameworks — resulting in the creation of robust data pipelines.
The first module introduces you to Scala libraries to ingest, store, manipulate, process, and visualize data. Using real world examples, you will learn how to design scalable architecture to process and model data — starting from simple concurrency constructs and progressing to actor systems and Apache Spark. After this, you will also learn how to build interactive visualizations with web frameworks.
Once you have become familiar with all the tasks involved in data science, you will explore data analytics with Scala in the second module. You’ll see how Scala can be used to make sense of data through easy to follow recipes. You will learn about Bokeh bindings for exploratory data analysis and quintessential machine learning with algorithms with Spark ML library. You’ll get a sufficient understanding of Spark streaming, machine learning for streaming data, and Spark graphX.
Armed with a firm understanding of data analysis, you will be ready to explore the most cutting-edge aspect of data science — machine learning. The final module teaches you the A to Z of machine learning with Scala. You’ll explore Scala for dependency injections and implicits, which are used to write machine learning algorithms. You’ll also explore machine learning topics such as clustering, dimentionality reduction, Naïve Bayes, Regression models, SVMs, neural networks, and more.
This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products:
Scala for Data Science, Pascal Bugnion
Scala Data Analysis Cookbook, Arun Manivannan
Scala for Machine Learning, Patrick R. Nicolas
Style and approach
A complete package with all the information necessary to start building useful data engineering and data science solutions straight away. It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala.
Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.
Table of contents
-
Scala: Guide for Data Science Professionals
- Table of Contents
- Scala: Guide for Data Science Professionals
- Scala: Guide for Data Science Professionals
- Credits
- Preface
-
1. Module 1
- 1. Scala and Data Science
-
2. Manipulating Data with Breeze
- Code examples
- Installing Breeze
- Getting help on Breeze
-
Basic Breeze data types
- Vectors
- Dense and sparse vectors and the vector trait
- Matrices
- Building vectors and matrices
- Advanced indexing and slicing
- Mutating vectors and matrices
- Matrix multiplication, transposition, and the orientation of vectors
- Data preprocessing and feature engineering
- Breeze – function optimization
- Numerical derivatives
- Regularization
- An example – logistic regression
- Towards re-usable code
- Alternatives to Breeze
- Summary
- References
- 3. Plotting with breeze-viz
- 4. Parallel Collections and Futures
-
5. Scala and SQL through JDBC
- Interacting with JDBC
- First steps with JDBC
- JDBC summary
- Functional wrappers for JDBC
- Safer JDBC connections with the loan pattern
- Enriching JDBC statements with the "pimp my library" pattern
- Wrapping result sets in a stream
- Looser coupling with type classes
- Creating a data access layer
- Summary
- References
- 6. Slick – A Functional Interface for SQL
- 7. Web APIs
- 8. Scala and MongoDB
-
9. Concurrency with Akka
- GitHub follower graph
- Actors as people
- Hello world with Akka
- Case classes as messages
- Actor construction
- Anatomy of an actor
- Follower network crawler
- Fetcher actors
- Routing
- Message passing between actors
- Queue control and the pull pattern
- Accessing the sender of a message
- Stateful actors
- Follower network crawler
- Fault tolerance
- Custom supervisor strategies
- Life-cycle hooks
- What we have not talked about
- Summary
- References
- 10. Distributed Batch Processing with Spark
-
11. Spark SQL and DataFrames
- DataFrames – a whirlwind introduction
- Aggregation operations
- Joining DataFrames together
- Custom functions on DataFrames
- DataFrame immutability and persistence
- SQL statements on DataFrames
- Complex data types – arrays, maps, and structs
- Interacting with data sources
- Standalone programs
- Summary
- References
- 12. Distributed Machine Learning with MLlib
-
13. Web APIs with Play
- Client-server applications
- Introduction to web frameworks
- Model-View-Controller architecture
- Single page applications
- Building an application
- The Play framework
- Dynamic routing
- Actions
- Interacting with JSON
- Querying external APIs and consuming JSON
- Creating APIs with Play: a summary
- Rest APIs: best practice
- Summary
- References
- 14. Visualization with D3 and the Play Framework
- A. Pattern Matching and Extractors
-
II. Module 2
-
1. Getting Started with Breeze
- Introduction
- Getting Breeze – the linear algebra library
-
Working with vectors
- Getting ready
-
How to do it...
- Creating vectors
- Constructing a vector from values
- Creating a vector out of a function
- Creating a vector of linearly spaced values
- Creating a vector with values in a specific range
- Creating an entire vector with a single value
- Slicing a sub-vector from a bigger vector
- Creating a Breeze Vector from a Scala Vector
- Vector arithmetic
- Scalar operations
- Calculating the dot product of two vectors
- Creating a new vector by adding two vectors together
- Appending vectors and converting a vector of one type to another
- Concatenating two vectors
- Standard deviation
- Find the largest value in a vector
- Finding the sum, square root and log of all the values in the vector
- Working with matrices
-
Vectors and matrices with randomly distributed values
-
How it works...
- Creating vectors with uniformly distributed random values
- Creating vectors with normally distributed random values
- Creating vectors with random values that have a Poisson distribution
- Creating a matrix with uniformly random values
- Creating a matrix with normally distributed random values
- Creating a matrix with random values that has a Poisson distribution
-
How it works...
- Reading and writing CSV files
-
2. Getting Started with Apache Spark DataFrames
- Introduction
- Getting Apache Spark
- Creating a DataFrame from CSV
- Manipulating DataFrames
- Creating a DataFrame from Scala case classes
- 3. Loading and Preparing Data – DataFrame
- 4. Data Visualization
-
5. Learning from Data
- Introduction
- Supervised and unsupervised learning
- Gradient descent
- Predicting continuous values using linear regression
- Binary classification using LogisticRegression and SVM
-
Binary classification using LogisticRegression with Pipeline API
-
How to do it...
- Importing and splitting data as test and training sets
- Construct the participants of the Pipeline
- Preparing a pipeline and training a model
- Predicting against test data
- Evaluating a model without cross-validation
- Constructing parameters for cross-validation
- Constructing cross-validator and fit the best model
- Evaluating the model with cross-validation
-
How to do it...
- Clustering using K-means
-
Feature reduction using principal component analysis
-
How to do it...
- Dimensionality reduction of data for supervised learning
- Mean-normalizing the training data
- Extracting the principal components
- Preparing the labeled data
- Preparing the test data
- Classify and evaluate the metrics
- Dimensionality reduction of data for unsupervised learning
- Mean-normalizing the training data
- Extracting the principal components
- Arriving at the number of components
- Evaluating the metrics
-
How to do it...
- 6. Scaling Up
- 7. Going Further
-
1. Getting Started with Breeze
-
III. Module 3
- 1. Getting Started
- 2. Hello World!
- 3. Data Preprocessing
- 4. Unsupervised Learning
- 5. Naïve Bayes Classifiers
- 6. Regression and Regularization
- 7. Sequential Data Models
- 8. Kernel Models and Support Vector Machines
-
9. Artificial Neural Networks
- Feed-forward neural networks (FFNN)
-
The multilayer perceptron (MLP)
- The activation function
- The network architecture
- Software design
- Model definition
- Training cycle/epoch
- Training strategies and classification
- Evaluation
- Benefits and limitations
- Summary
- 10. Genetic Algorithms
- 11. Reinforcement Learning
- 12. Scalable Frameworks
- B. Basic Concepts
- C. Bibliography
- Index
Product information
- Title: Scala: Guide for Data Science Professionals
- Author(s):
- Release date: February 2017
- Publisher(s): Packt Publishing
- ISBN: 9781787282858
You might also like
book
Scala for Data Science
Leverage the power of Scala with different tools to build scalable, robust data science applications About …
book
Scala and Spark for Big Data Analytics
Harness the power of Scala to program Spark and analyze tonnes of data in the blink …
book
Scala for Machine Learning - Second Edition
Leverage Scala and Machine Learning to study and construct systems that can learn from data About …
book
Learning Scala
Why learn Scala? You don’t need to be a data scientist or distributed computing expert to …