Scala: Guide for Data Science Professionals

Book description

Scala will be a valuable tool to have on hand during your data science journey for everything from data cleaning to cutting-edge machine learning

About This Book

  • Build data science and data engineering solutions with ease

  • An in-depth look at each stage of the data analysis process — from reading and collecting data to distributed analytics

  • Explore a broad variety of data processing, machine learning, and genetic algorithms through diagrams, mathematical formulations, and source code

  • Who This Book Is For

    This learning path is perfect for those who are comfortable with Scala programming and now want to enter the field of data science. Some knowledge of statistics is expected.

    What You Will Learn

  • Transfer and filter tabular data to extract features for machine learning

  • Read, clean, transform, and write data to both SQL and NoSQL databases

  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations

  • Load data from HDFS and HIVE with ease

  • Run streaming and graph analytics in Spark for exploratory analysis

  • Bundle and scale up Spark jobs by deploying them into a variety of cluster managers

  • Build dynamic workflows for scientific computing

  • Leverage open source libraries to extract patterns from time series

  • Master probabilistic models for sequential data

  • In Detail

    Scala is especially good for analyzing large sets of data as the scale of the task doesn’t have any significant impact on performance. Scala’s powerful functional libraries can interact with databases and build scalable frameworks — resulting in the creation of robust data pipelines.

    The first module introduces you to Scala libraries to ingest, store, manipulate, process, and visualize data. Using real world examples, you will learn how to design scalable architecture to process and model data — starting from simple concurrency constructs and progressing to actor systems and Apache Spark. After this, you will also learn how to build interactive visualizations with web frameworks.

    Once you have become familiar with all the tasks involved in data science, you will explore data analytics with Scala in the second module. You’ll see how Scala can be used to make sense of data through easy to follow recipes. You will learn about Bokeh bindings for exploratory data analysis and quintessential machine learning with algorithms with Spark ML library. You’ll get a sufficient understanding of Spark streaming, machine learning for streaming data, and Spark graphX.

    Armed with a firm understanding of data analysis, you will be ready to explore the most cutting-edge aspect of data science — machine learning. The final module teaches you the A to Z of machine learning with Scala. You’ll explore Scala for dependency injections and implicits, which are used to write machine learning algorithms. You’ll also explore machine learning topics such as clustering, dimentionality reduction, Naïve Bayes, Regression models, SVMs, neural networks, and more.

    This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products:

  • Scala for Data Science, Pascal Bugnion

  • Scala Data Analysis Cookbook, Arun Manivannan

  • Scala for Machine Learning, Patrick R. Nicolas

  • Style and approach

    A complete package with all the information necessary to start building useful data engineering and data science solutions straight away. It contains a diverse set of recipes that cover the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of contents

    1. Scala: Guide for Data Science Professionals
      1. Table of Contents
      2. Scala: Guide for Data Science Professionals
      3. Scala: Guide for Data Science Professionals
      4. Credits
      5. Preface
        1. What this learning path covers
        2. What you need for this learning path
        3. Who this learning path is for
        4. Reader feedback
        5. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      6. 1. Module 1
        1. 1. Scala and Data Science
          1. Data science
          2. Programming in data science
          3. Why Scala?
            1. Static typing and type inference
            2. Scala encourages immutability
            3. Scala and functional programs
            4. Null pointer uncertainty
            5. Easier parallelism
            6. Interoperability with Java
          4. When not to use Scala
          5. Summary
          6. References
        2. 2. Manipulating Data with Breeze
          1. Code examples
          2. Installing Breeze
          3. Getting help on Breeze
          4. Basic Breeze data types
            1. Vectors
            2. Dense and sparse vectors and the vector trait
            3. Matrices
            4. Building vectors and matrices
            5. Advanced indexing and slicing
            6. Mutating vectors and matrices
            7. Matrix multiplication, transposition, and the orientation of vectors
            8. Data preprocessing and feature engineering
            9. Breeze – function optimization
            10. Numerical derivatives
            11. Regularization
          5. An example – logistic regression
          6. Towards re-usable code
          7. Alternatives to Breeze
          8. Summary
          9. References
        3. 3. Plotting with breeze-viz
          1. Diving into Breeze
          2. Customizing plots
          3. Customizing the line type
          4. More advanced scatter plots
          5. Multi-plot example – scatterplot matrix plots
          6. Managing without documentation
          7. Breeze-viz reference
          8. Data visualization beyond breeze-viz
          9. Summary
        4. 4. Parallel Collections and Futures
          1. Parallel collections
            1. Limitations of parallel collections
            2. Error handling
            3. Setting the parallelism level
            4. An example – cross-validation with parallel collections
          2. Futures
            1. Future composition – using a future's result
            2. Blocking until completion
            3. Controlling parallel execution with execution contexts
            4. Futures example – stock price fetcher
          3. Summary
          4. References
        5. 5. Scala and SQL through JDBC
          1. Interacting with JDBC
          2. First steps with JDBC
            1. Connecting to a database server
            2. Creating tables
            3. Inserting data
            4. Reading data
          3. JDBC summary
          4. Functional wrappers for JDBC
          5. Safer JDBC connections with the loan pattern
          6. Enriching JDBC statements with the "pimp my library" pattern
          7. Wrapping result sets in a stream
          8. Looser coupling with type classes
            1. Type classes
            2. Coding against type classes
            3. When to use type classes
            4. Benefits of type classes
          9. Creating a data access layer
          10. Summary
          11. References
        6. 6. Slick – A Functional Interface for SQL
          1. FEC data
            1. Importing Slick
            2. Defining the schema
            3. Connecting to the database
            4. Creating tables
            5. Inserting data
            6. Querying data
          2. Invokers
          3. Operations on columns
          4. Aggregations with "Group by"
          5. Accessing database metadata
          6. Slick versus JDBC
          7. Summary
          8. References
        7. 7. Web APIs
          1. A whirlwind tour of JSON
          2. Querying web APIs
          3. JSON in Scala – an exercise in pattern matching
            1. JSON4S types
            2. Extracting fields using XPath
          4. Extraction using case classes
          5. Concurrency and exception handling with futures
          6. Authentication – adding HTTP headers
            1. HTTP – a whirlwind overview
            2. Adding headers to HTTP requests in Scala
          7. Summary
          8. References
        8. 8. Scala and MongoDB
          1. MongoDB
          2. Connecting to MongoDB with Casbah
            1. Connecting with authentication
          3. Inserting documents
          4. Extracting objects from the database
          5. Complex queries
          6. Casbah query DSL
          7. Custom type serialization
          8. Beyond Casbah
          9. Summary
          10. References
        9. 9. Concurrency with Akka
          1. GitHub follower graph
          2. Actors as people
          3. Hello world with Akka
          4. Case classes as messages
          5. Actor construction
          6. Anatomy of an actor
          7. Follower network crawler
          8. Fetcher actors
          9. Routing
          10. Message passing between actors
          11. Queue control and the pull pattern
          12. Accessing the sender of a message
          13. Stateful actors
          14. Follower network crawler
          15. Fault tolerance
          16. Custom supervisor strategies
          17. Life-cycle hooks
          18. What we have not talked about
          19. Summary
          20. References
        10. 10. Distributed Batch Processing with Spark
          1. Installing Spark
          2. Acquiring the example data
          3. Resilient distributed datasets
            1. RDDs are immutable
            2. RDDs are lazy
            3. RDDs know their lineage
            4. RDDs are resilient
            5. RDDs are distributed
            6. Transformations and actions on RDDs
            7. Persisting RDDs
            8. Key-value RDDs
            9. Double RDDs
          4. Building and running standalone programs
            1. Running Spark applications locally
            2. Reducing logging output and Spark configuration
            3. Running Spark applications on EC2
          5. Spam filtering
          6. Lifting the hood
          7. Data shuffling and partitions
          8. Summary
          9. Reference
        11. 11. Spark SQL and DataFrames
          1. DataFrames – a whirlwind introduction
          2. Aggregation operations
          3. Joining DataFrames together
          4. Custom functions on DataFrames
          5. DataFrame immutability and persistence
          6. SQL statements on DataFrames
          7. Complex data types – arrays, maps, and structs
            1. Structs
            2. Arrays
            3. Maps
          8. Interacting with data sources
            1. JSON files
            2. Parquet files
          9. Standalone programs
          10. Summary
          11. References
        12. 12. Distributed Machine Learning with MLlib
          1. Introducing MLlib – Spam classification
          2. Pipeline components
            1. Transformers
            2. Estimators
          3. Evaluation
          4. Regularization in logistic regression
          5. Cross-validation and model selection
          6. Beyond logistic regression
          7. Summary
          8. References
        13. 13. Web APIs with Play
          1. Client-server applications
          2. Introduction to web frameworks
          3. Model-View-Controller architecture
          4. Single page applications
          5. Building an application
          6. The Play framework
          7. Dynamic routing
          8. Actions
            1. Composing the response
            2. Understanding and parsing the request
          9. Interacting with JSON
          10. Querying external APIs and consuming JSON
            1. Calling external web services
            2. Parsing JSON
            3. Asynchronous actions
          11. Creating APIs with Play: a summary
          12. Rest APIs: best practice
          13. Summary
          14. References
        14. 14. Visualization with D3 and the Play Framework
          1. GitHub user data
          2. Do I need a backend?
          3. JavaScript dependencies through web-jars
          4. Towards a web application: HTML templates
          5. Modular JavaScript through RequireJS
          6. Bootstrapping the applications
          7. Client-side program architecture
            1. Designing the model
            2. The event bus
            3. AJAX calls through JQuery
            4. Response views
          8. Drawing plots with NVD3
          9. Summary
          10. References
        15. A. Pattern Matching and Extractors
          1. Pattern matching in for comprehensions
          2. Pattern matching internals
          3. Extracting sequences
          4. Summary
          5. Reference
      7. II. Module 2
        1. 1. Getting Started with Breeze
          1. Introduction
          2. Getting Breeze – the linear algebra library
            1. How to do it...
            2. There's more...
              1. The org.scalanlp.breeze dependency
              2. The org.scalanlp.breeze-natives package
          3. Working with vectors
            1. Getting ready
            2. How to do it...
              1. Creating vectors
              2. Constructing a vector from values
                1. Creating a zero vector
              3. Creating a vector out of a function
              4. Creating a vector of linearly spaced values
              5. Creating a vector with values in a specific range
              6. Creating an entire vector with a single value
              7. Slicing a sub-vector from a bigger vector
              8. Creating a Breeze Vector from a Scala Vector
              9. Vector arithmetic
              10. Scalar operations
              11. Calculating the dot product of two vectors
              12. Creating a new vector by adding two vectors together
              13. Appending vectors and converting a vector of one type to another
              14. Concatenating two vectors
                1. Converting a vector of Int to a vector of Double
                2. Computing basic statistics
                3. Mean and variance
              15. Standard deviation
              16. Find the largest value in a vector
              17. Finding the sum, square root and log of all the values in the vector
                1. The Sqrt function
                2. The Log function
          4. Working with matrices
            1. How to do it...
              1. Creating matrices
                1. Creating a matrix from values
                2. Creating a zero matrix
                3. Creating a matrix out of a function
                4. Creating an identity matrix
                5. Creating a matrix from random numbers
                6. Creating from a Scala collection
              2. Matrix arithmetic
                1. Addition
                2. Multiplication
              3. Appending and conversion
                1. Concatenating matrices – vertically
                2. Concatenating matrices – horizontally
                3. Converting a matrix of Int to a matrix of Double
              4. Data manipulation operations
                1. Getting column vectors out of the matrix
                2. Getting row vectors out of the matrix
                3. Getting values inside the matrix
                4. Getting the inverse and transpose of a matrix
              5. Computing basic statistics
                1. Mean and variance
                2. Standard deviation
                3. Finding the largest value in a matrix
                4. Finding the sum, square root and log of all the values in the matrix
                5. Sqrt
                6. Log
                7. Calculating the eigenvectors and eigenvalues of a matrix
            2. How it works...
          5. Vectors and matrices with randomly distributed values
            1. How it works...
              1. Creating vectors with uniformly distributed random values
              2. Creating vectors with normally distributed random values
              3. Creating vectors with random values that have a Poisson distribution
              4. Creating a matrix with uniformly random values
              5. Creating a matrix with normally distributed random values
              6. Creating a matrix with random values that has a Poisson distribution
          6. Reading and writing CSV files
            1. How it works...
        2. 2. Getting Started with Apache Spark DataFrames
          1. Introduction
          2. Getting Apache Spark
            1. How to do it...
          3. Creating a DataFrame from CSV
            1. How to do it...
            2. How it works...
            3. There's more…
          4. Manipulating DataFrames
            1. How to do it...
              1. Printing the schema of the DataFrame
              2. Sampling the data in the DataFrame
              3. Selecting DataFrame columns
              4. Filtering data by condition
              5. Sorting data in the frame
              6. Renaming columns
              7. Treating the DataFrame as a relational table
              8. Joining two DataFrames
                1. Inner join
                2. Right outer join
                3. Left outer join
              9. Saving the DataFrame as a file
          5. Creating a DataFrame from Scala case classes
            1. How to do it...
            2. How it works...
        3. 3. Loading and Preparing Data – DataFrame
          1. Introduction
          2. Loading more than 22 features into classes
            1. How to do it...
            2. How it works...
            3. There's more…
          3. Loading JSON into DataFrames
            1. How to do it…
              1. Reading a JSON file using SQLContext.jsonFile
              2. Reading a text file and converting it to JSON RDD
              3. Explicitly specifying your schema
            2. There's more…
          4. Storing data as Parquet files
            1. How to do it…
              1. Load a simple CSV file, convert it to case classes, and create a DataFrame from it
              2. Save it as a Parquet file
              3. Install Parquet tools
              4. Using the tools to inspect the Parquet file
              5. Enable compression for the Parquet file
          5. Using the Avro data model in Parquet
            1. How to do it…
              1. Creation of the Avro model
              2. Generation of Avro objects using the sbt-avro plugin
              3. Constructing an RDD of our generated object from Students.csv
              4. Saving RDD[StudentAvro] in a Parquet file
              5. Reading the file back for verification
              6. Using Parquet tools for verification
          6. Loading from RDBMS
            1. How to do it…
          7. Preparing data in Dataframes
            1. How to do it...
        4. 4. Data Visualization
          1. Introduction
          2. Visualizing using Zeppelin
            1. How to do it...
              1. Installing Zeppelin
              2. Customizing Zeppelin's server and websocket port
              3. Visualizing data on HDFS – parameterizing inputs
              4. Running custom functions
              5. Adding external dependencies to Zeppelin
              6. Pointing to an external Spark cluster
          3. Creating scatter plots with Bokeh-Scala
            1. How to do it...
              1. Preparing our data
              2. Creating Plot and Document objects
              3. Creating a marker object
              4. Setting the X and Y axes' data range for the plot
              5. Drawing the x and the y axes
              6. Viewing flower species with varying colors
              7. Adding grid lines
              8. Adding a legend to the plot
          4. Creating a time series MultiPlot with Bokeh-Scala
            1. How to do it...
              1. Preparing our data
              2. Creating a plot
              3. Creating a line that joins all the data points
              4. Setting the x and y axes' data range for the plot
              5. Drawing the axes and the grids
              6. Adding tools
              7. Adding a legend to the plot
              8. Multiple plots in the document
        5. 5. Learning from Data
          1. Introduction
          2. Supervised and unsupervised learning
          3. Gradient descent
          4. Predicting continuous values using linear regression
            1. How to do it...
              1. Importing the data
              2. Converting each instance into a LabeledPoint
              3. Preparing the training and test data
              4. Scaling the features
              5. Training the model
              6. Predicting against test data
              7. Evaluating the model
              8. Regularizing the parameters
              9. Mini batching
          5. Binary classification using LogisticRegression and SVM
            1. How to do it...
              1. Importing the data
              2. Tokenizing the data and converting it into LabeledPoints
              3. Factoring the inverse document frequency
              4. Prepare the training and test data
              5. Constructing the algorithm
              6. Training the model and predicting the test data
              7. Evaluating the model
          6. Binary classification using LogisticRegression with Pipeline API
            1. How to do it...
              1. Importing and splitting data as test and training sets
              2. Construct the participants of the Pipeline
              3. Preparing a pipeline and training a model
              4. Predicting against test data
              5. Evaluating a model without cross-validation
              6. Constructing parameters for cross-validation
              7. Constructing cross-validator and fit the best model
              8. Evaluating the model with cross-validation
          7. Clustering using K-means
            1. How to do it...
              1. KMeans.RANDOM
              2. KMeans.PARALLEL
                1. K-means++
                2. K-means||
              3. Max iterations
              4. Epsilon
              5. Importing the data and converting it into a vector
              6. Feature scaling the data
              7. Deriving the number of clusters
              8. Constructing the model
              9. Evaluating the model
          8. Feature reduction using principal component analysis
            1. How to do it...
              1. Dimensionality reduction of data for supervised learning
              2. Mean-normalizing the training data
              3. Extracting the principal components
              4. Preparing the labeled data
              5. Preparing the test data
              6. Classify and evaluate the metrics
              7. Dimensionality reduction of data for unsupervised learning
              8. Mean-normalizing the training data
              9. Extracting the principal components
              10. Arriving at the number of components
              11. Evaluating the metrics
        6. 6. Scaling Up
          1. Introduction
          2. Building the Uber JAR
            1. How to do it...
              1. Transitive dependency stated explicitly in the SBT dependency
                1. Two different libraries depend on the same external library
          3. Submitting jobs to the Spark cluster (local)
            1. How to do it...
              1. Downloading Spark
              2. Running HDFS on Pseudo-clustered mode
              3. Running the Spark master and slave locally
              4. Pushing data into HDFS
              5. Submitting the Spark application on the cluster
          4. Running the Spark Standalone cluster on EC2
            1. How to do it...
              1. Creating the AccessKey and pem file
              2. Setting the environment variables
              3. Running the launch script
              4. Verifying installation
              5. Making changes to the code
              6. Transferring the data and job files
              7. Loading the dataset into HDFS
              8. Running the job
              9. Destroying the cluster
          5. Running the Spark Job on Mesos (local)
            1. How to do it...
              1. Installing Mesos
              2. Starting the Mesos master and slave
              3. Uploading the Spark binary package and the dataset to HDFS
              4. Running the job
          6. Running the Spark Job on YARN (local)
            1. How to do it...
              1. Installing the Hadoop cluster
              2. Starting HDFS and YARN
              3. Pushing Spark assembly and dataset to HDFS
              4. Running a Spark job in yarn-client mode
              5. Running Spark job in yarn-cluster mode
        7. 7. Going Further
          1. Introduction
          2. Using Spark Streaming to subscribe to a Twitter stream
            1. How to do it...
          3. Using Spark as an ETL tool
            1. How to do it...
          4. Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
            1. How to do it...
          5. Using GraphX to analyze Twitter data
            1. How to do it...
      8. III. Module 3
        1. 1. Getting Started
          1. Mathematical notation for the curious
          2. Why machine learning?
            1. Classification
            2. Prediction
            3. Optimization
            4. Regression
          3. Why Scala?
            1. Abstraction
            2. Scalability
            3. Configurability
            4. Maintainability
            5. Computation on demand
          4. Model categorization
          5. Taxonomy of machine learning algorithms
            1. Unsupervised learning
              1. Clustering
              2. Dimension reduction
            2. Supervised learning
              1. Generative models
              2. Discriminative models
            3. Reinforcement learning
          6. Tools and frameworks
            1. Java
            2. Scala
            3. Apache Commons Math
              1. Description
              2. Licensing
              3. Installation
            4. JFreeChart
              1. Description
              2. Licensing
              3. Installation
            5. Other libraries and frameworks
          7. Source code
            1. Context versus view bounds
            2. Presentation
            3. Primitives and implicits
              1. Primitive types
              2. Type conversions
              3. Operators
            4. Immutability
            5. Performance of Scala iterators
          8. Let's kick the tires
            1. Overview of computational workflows
            2. Writing a simple workflow
              1. Selecting a dataset
              2. Loading the dataset
              3. Preprocessing the dataset
                1. Basic statistics
                2. Normalization and Gauss distribution
                3. Plotting data
              4. Creating a model (learning)
              5. Classify the data
          9. Summary
        2. 2. Hello World!
          1. Modeling
            1. A model by any other name
            2. Model versus design
            3. Selecting a model's features
            4. Extracting features
          2. Designing a workflow
            1. The computational framework
            2. The pipe operator
            3. Monadic data transformation
            4. Dependency injection
            5. Workflow modules
            6. The workflow factory
            7. Examples of workflow components
              1. The preprocessing module
              2. The clustering module
          3. Assessing a model
            1. Validation
              1. Key metrics
              2. Implementation
            2. K-fold cross-validation
            3. Bias-variance decomposition
            4. Overfitting
          4. Summary
        3. 3. Data Preprocessing
          1. Time series
          2. Moving averages
            1. The simple moving average
            2. The weighted moving average
            3. The exponential moving average
          3. Fourier analysis
            1. Discrete Fourier transform (DFT)
            2. DFT-based filtering
            3. Detection of market cycles
          4. The Kalman filter
            1. The state space estimation
              1. The transition equation
              2. The measurement equation
            2. The recursive algorithm
              1. Prediction
              2. Correction
              3. Kalman smoothing
              4. Experimentation
          5. Alternative preprocessing techniques
          6. Summary
        4. 4. Unsupervised Learning
          1. Clustering
            1. K-means clustering
              1. Measuring similarity
              2. Overview of the K-means algorithm
              3. Step 1 – cluster configuration
                1. Defining clusters
                2. Defining K-means
                3. Initializing clusters
              4. Step 2 – cluster assignment
              5. Step 3 – iterative reconstruction
              6. Curse of dimensionality
              7. Experiment
              8. Tuning the number of clusters
              9. Validation
            2. Expectation-maximization (EM) algorithm
              1. Gaussian mixture model
              2. EM overview
              3. Implementation
              4. Testing
              5. Online EM
          2. Dimension reduction
            1. Principal components analysis (PCA)
              1. Algorithm
              2. Implementation
              3. Test case
              4. Evaluation
            2. Other dimension reduction techniques
          3. Performance considerations
            1. K-means
            2. EM
            3. PCA
          4. Summary
        5. 5. Naïve Bayes Classifiers
          1. Probabilistic graphical models
          2. Naïve Bayes classifiers
            1. Introducing the multinomial Naïve Bayes
              1. Formalism
              2. The frequentist perspective
              3. The predictive model
              4. The zero-frequency problem
            2. Implementation
              1. Software design
              2. Training
              3. Classification
              4. Labeling
              5. Results
          3. Multivariate Bernoulli classification
            1. Model
            2. Implementation
          4. Naïve Bayes and text mining
            1. Basics of information retrieval
            2. Implementation
              1. Extraction of terms
              2. Scoring of terms
            3. Testing
              1. Retrieving textual information
              2. Evaluation
          5. Pros and cons
          6. Summary
        6. 6. Regression and Regularization
          1. Linear regression
            1. One-variate linear regression
              1. Implementation
              2. Test case
            2. Ordinary least squares (OLS) regression
              1. Design
              2. Implementation
              3. Test case 1 – trending
              4. Test case 2 – features selection
          2. Regularization
            1. Ln roughness penalty
            2. The ridge regression
              1. Implementation
              2. The test case
          3. Numerical optimization
          4. The logistic regression
            1. The logit function
            2. Binomial classification
            3. Software design
            4. The training workflow
              1. Configuring the least squares optimizer
              2. Computing the Jacobian matrix
              3. Defining the exit conditions
              4. Defining the least squares problem
              5. Minimizing the loss function
              6. Test
            5. Classification
          5. Summary
        7. 7. Sequential Data Models
          1. Markov decision processes
            1. The Markov property
            2. The first-order discrete Markov chain
          2. The hidden Markov model (HMM)
            1. Notation
            2. The lambda model
            3. HMM execution state
            4. Evaluation (CF-1)
              1. Alpha class (the forward variable)
              2. Beta class (the backward variable)
            5. Training (CF-2)
              1. Baum-Welch estimator (EM)
            6. Decoding (CF-3)
              1. The Viterbi algorithm
            7. Putting it all together
            8. Test case
            9. The hidden Markov model for time series analysis
          3. Conditional random fields
            1. Introduction to CRF
            2. Linear chain CRF
          4. CRF and text analytics
            1. The feature functions model
            2. Software design
            3. Implementation
              1. Building the training set
              2. Generating tags
              3. Extracting data sequences
              4. CRF control parameters
              5. Putting it all together
            4. Tests
              1. The training convergence profile
              2. Impact of the size of the training set
              3. Impact of the L2 regularization factor
          5. Comparing CRF and HMM
          6. Performance consideration
          7. Summary
        8. 8. Kernel Models and Support Vector Machines
          1. Kernel functions
            1. Overview
            2. Common discriminative kernels
          2. The support vector machine (SVM)
            1. The linear SVM
              1. The separable case (hard margin)
              2. The nonseparable case (soft margin)
            2. The nonlinear SVM
              1. Max-margin classification
              2. The kernel trick
          3. Support vector classifier (SVC)
            1. The binary SVC
              1. LIBSVM
              2. Software design
              3. Configuration parameters
                1. SVM Formulation
                2. The SVM kernel function
                3. SVM execution
              4. SVM implementation
              5. C-penalty and margin
              6. Kernel evaluation
              7. Application to risk analysis
                1. Features and labels
          4. Anomaly detection with one-class SVC
          5. Support vector regression (SVR)
            1. Overview
            2. SVR versus linear regression
          6. Performance considerations
          7. Summary
        9. 9. Artificial Neural Networks
          1. Feed-forward neural networks (FFNN)
            1. The Biological background
            2. The mathematical background
          2. The multilayer perceptron (MLP)
            1. The activation function
            2. The network architecture
            3. Software design
            4. Model definition
              1. Layers
              2. Synapses
              3. Connections
            5. Training cycle/epoch
              1. Step 1 – input forward propagation
                1. The computational model
                2. Objective
                3. Softmax
              2. Step 2 – sum of squared errors
              3. Step 3 – error backpropagation
                1. Error propagation
                2. The computational model
              4. Step 4 – synapse/weights adjustment
                1. Momentum factor for gradient descent
                2. Implementation
              5. Step 5 – convergence criteria
              6. Configuration
              7. Putting all together
            6. Training strategies and classification
              1. Online versus batch training
              2. Regularization
              3. Model instantiation
              4. Prediction
          3. Evaluation
            1. Impact of learning rate
            2. Impact of the momentum factor
            3. Test case
              1. Implementation
              2. Models evaluation
              3. Impact of hidden layers architecture
          4. Benefits and limitations
          5. Summary
        10. 10. Genetic Algorithms
          1. Evolution
            1. The origin
            2. NP problems
            3. Evolutionary computing
          2. Genetic algorithms and machine learning
          3. Genetic algorithm components
            1. Encodings
              1. Value encoding
              2. Predicate encoding
              3. Solution encoding
              4. The encoding scheme
                1. Flat encoding
                2. Hierarchical encoding
            2. Genetic operators
              1. Selection
              2. Crossover
              3. Mutation
            3. Fitness score
          4. Implementation
            1. Software design
            2. Key components
            3. Selection
            4. Controlling population growth
            5. GA configuration
            6. Crossover
              1. Population
              2. Chromosomes
              3. Genes
            7. Mutation
              1. Population
              2. Chromosomes
              3. Genes
            8. The reproduction cycle
          5. GA for trading strategies
            1. Definition of trading strategies
              1. Trading operators
              2. The cost/unfitness function
              3. Trading signals
              4. Trading strategies
              5. Signal encoding
            2. Test case
              1. Data extraction
              2. Initial population
              3. Configuration
              4. GA instantiation
              5. GA execution
              6. Tests
                1. The unweighted score
                2. The weighted score
          6. Advantages and risks of genetic algorithms
          7. Summary
        11. 11. Reinforcement Learning
          1. Introduction
            1. The problem
            2. A solution – Q-learning
              1. Terminology
              2. Concept
              3. Value of policy
              4. Bellman optimality equations
              5. Temporal difference for model-free learning
              6. Action-value iterative update
            3. Implementation
              1. Software design
              2. States and actions
              3. Search space
              4. Policy and action-value
              5. The Q-learning training
              6. Tail recursion to the rescue
              7. Prediction
            4. Option trading using Q-learning
              1. Option property
              2. Option model
              3. Function approximation
              4. Constrained state-transition
              5. Putting it all together
            5. Evaluation
            6. Pros and cons of reinforcement learning
          2. Learning classifier systems
            1. Introduction to LCS
            2. Why LCS
            3. Terminology
            4. Extended learning classifier systems (XCS)
            5. XCS components
              1. Application to portfolio management
              2. XCS core data
              3. XCS rules
              4. Covering
              5. Example of implementation
            6. Benefits and limitation of learning classifier systems
          3. Summary
        12. 12. Scalable Frameworks
          1. Overview
          2. Scala
            1. Controlling object creation
            2. Parallel collections
              1. Processing a parallel collection
              2. Benchmark framework
              3. Performance evaluation
          3. Scalability with Actors
            1. The Actor model
            2. Partitioning
            3. Beyond actors – reactive programming
          4. Akka
            1. Master-workers
              1. Messages exchange
              2. Worker actors
              3. The workflow controller
              4. The master Actor
              5. Master with routing
              6. Distributed discrete Fourier transform
              7. Limitations
            2. Futures
              1. The Actor life cycle
              2. Blocking on futures
              3. Handling future callbacks
              4. Putting all together
          5. Apache Spark
            1. Why Spark
            2. Design principles
              1. In-memory persistency
              2. Laziness
              3. Transforms and Actions
              4. Shared variables
            3. Experimenting with Spark
              1. Deploying Spark
              2. Using Spark shell
              3. MLlib
              4. RDD generation
              5. K-means using Spark
            4. Performance evaluation
              1. Tuning parameters
              2. Tests
              3. Performance considerations
            5. Pros and cons
            6. 0xdata Sparkling Water
          6. Summary
        13. B. Basic Concepts
          1. Scala programming
            1. List of libraries
            2. Format of code snippets
            3. Encapsulation
            4. Class constructor template
            5. Companion objects versus case classes
            6. Enumerations versus case classes
            7. Overloading
            8. Design template for classifiers
            9. Data extraction
            10. Data sources
            11. Extraction of documents
            12. Matrix class
          2. Mathematics
            1. Linear algebra
              1. QR Decomposition
              2. LU factorization
              3. LDL decomposition
              4. Cholesky factorization
              5. Singular value decomposition
              6. Eigenvalue decomposition
              7. Algebraic and numerical libraries
            2. First order predicate logic
            3. Jacobian and Hessian matrices
            4. Summary of optimization techniques
              1. Gradient descent methods
                1. Steepest descent
                2. Conjugate gradient
                3. Stochastic gradient descent
              2. Quasi-Newton algorithms
                1. BFGS
                2. L-BFGS
              3. Nonlinear least squares minimization
                1. Gauss-Newton
                2. Levenberg-Marquardt
              4. Lagrange multipliers
            5. Overview of dynamic programming
          3. Finances 101
            1. Fundamental analysis
            2. Technical analysis
              1. Terminology
              2. Trading signals and strategy
              3. Price patterns
            3. Options trading
            4. Financial data sources
          4. Suggested online courses
          5. References
        14. C. Bibliography
      9. Index

    Product information

    • Title: Scala: Guide for Data Science Professionals
    • Author(s): Pascal Bugnion, Arun Manivannan, Patrick R. Nicolas
    • Release date: February 2017
    • Publisher(s): Packt Publishing
    • ISBN: 9781787282858