Practical Big Data Analytics

Book description

Get command of your organizational Big Data using the power of data science and analytics

About This Book

  • A perfect companion to boost your Big Data storing, processing, analyzing skills to help you take informed business decisions
  • Work with the best tools such as Apache Hadoop, R, Python, and Spark for NoSQL platforms to perform massive online analyses
  • Get expert tips on statistical inference, machine learning, mathematical modeling, and data visualization for Big Data

Who This Book Is For

The book is intended for existing and aspiring Big Data professionals who wish to become the go-to person in their organization when it comes to Big Data architecture, analytics, and governance. While no prior knowledge of Big Data or related technologies is assumed, it will be helpful to have some programming experience.

What You Will Learn

  • Get a 360-degree view into the world of Big Data, data science and machine learning
  • Broad range of technical and business Big Data analytics topics that caters to the interests of the technical experts as well as corporate IT executives
  • Get hands-on experience with industry-standard Big Data and machine learning tools such as Hadoop, Spark, MongoDB, KDB+ and R
  • Create production-grade machine learning BI Dashboards using R and R Shiny with step-by-step instructions
  • Learn how to combine open-source Big Data, machine learning and BI Tools to create low-cost business analytics applications
  • Understand corporate strategies for successful Big Data and data science projects
  • Go beyond general-purpose analytics to develop cutting-edge Big Data applications using emerging technologies

In Detail

Big Data analytics relates to the strategies used by organizations to collect, organize and analyze large amounts of data to uncover valuable business insights that otherwise cannot be analyzed through traditional systems. Crafting an enterprise-scale cost-efficient Big Data and machine learning solution to uncover insights and value from your organization's data is a challenge. Today, with hundreds of new Big Data systems, machine learning packages and BI Tools, selecting the right combination of technologies is an even greater challenge. This book will help you do that.

With the help of this guide, you will be able to bridge the gap between the theoretical world of technology with the practical ground reality of building corporate Big Data and data science platforms. You will get hands-on exposure to Hadoop and Spark, build machine learning dashboards using R and R Shiny, create web-based apps using NoSQL databases such as MongoDB and even learn how to write R code for neural networks.

By the end of the book, you will have a very clear and concrete understanding of what Big Data analytics means, how it drives revenues for organizations, and how you can develop your own Big Data analytics solution using different tools and methods articulated in this book.

Style and approach

This book equips you with a knowledge of various NoSQL tools, R, Python programming, cloud platforms, and techniques so you can use them to store, analyze, and deliver meaningful insights from your data.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  2. Too Big or Not Too Big
    1. What is big data?
      1. A brief history of data
        1. Dawn of the information age
        2. Dr. Alan Turing and modern computing
        3. The advent of the stored-program computer
        4. From magnetic devices to SSDs
    2. Why we are talking about big data now if data has always existed
      1. Definition of big data
        1. Building blocks of big data analytics
    3. Types of Big Data
      1. Structured
      2. Unstructured
      3. Semi-structured
    4. Sources of big data
      1. The 4Vs of big data
    5. When do you know you have a big data problem and where do you start your search for the big data solution?
    6. Summary
  3. Big Data Mining for the Masses
    1. What is big data mining?
      1. Big data mining in the enterprise
        1. Building the case for a Big Data strategy
        2. Implementation life cycle
        3. Stakeholders of the solution
        4. Implementing the solution
    2. Technical elements of the big data platform
      1. Selection of the hardware stack
      2. Selection of the software stack
    3. Summary
  4. The Analytics Toolkit
    1. Components of the Analytics Toolkit
    2. System recommendations
      1. Installing on a laptop or workstation
      2. Installing on the cloud
    3. Installing Hadoop
      1. Installing Oracle VirtualBox
      2. Installing CDH in other environments
    4. Installing Packt Data Science Box
    5. Installing Spark
    6. Installing R
      1. Steps for downloading and installing Microsoft R Open
    7. Installing RStudio
    8. Installing Python
    9. Summary
  5. Big Data With Hadoop
    1. The fundamentals of Hadoop
      1. The fundamental premise of Hadoop
      2. The core modules of Hadoop
        1. Hadoop Distributed File System - HDFS
        2. Data storage process in HDFS
      3. Hadoop MapReduce
        1. An intuitive introduction to MapReduce
        2. A technical understanding of MapReduce
        3. Block size and number of mappers and reducers
      4. Hadoop YARN
        1. Job scheduling in YARN
        2. Other topics in Hadoop
          1. Encryption
          2. User authentication
          3. Hadoop data storage formats
        3. New features expected in Hadoop 3
    2. The Hadoop ecosystem
    3. Hands-on with CDH
      1. WordCount using Hadoop MapReduce
      2. Analyzing oil import prices with Hive
        1. Joining tables in Hive
    4. Summary
  6. Big Data Mining with NoSQL
    1. Why NoSQL?
      1. The ACID, BASE, and CAP properties
        1. ACID and SQL
        2. The BASE property of NoSQL
        3. The CAP theorem
      2. The need for NoSQL technologies
        1. Google Bigtable
        2. Amazon Dynamo
    2. NoSQL databases
      1. In-memory databases
      2. Columnar databases
      3. Document-oriented databases
      4. Key-value databases
      5. Graph databases
      6. Other NoSQL types and summary of other types of databases 
    3. Analyzing Nobel Laureates data with MongoDB
      1. JSON format
      2. Installing and using MongoDB
    4. Tracking physician payments with real-world data
      1. Installing kdb+, R, and RStudio
        1. Installing kdb+
        2. Installing R
        3. Installing RStudio
    5. The CMS Open Payments Portal
      1. Downloading the CMS Open Payments data
      2. Creating the Q application
        1. Loading the data
        2. The backend code
      3. Creating the frontend web portal
    6. R Shiny platform for developers
      1. Putting it all together - The CMS Open Payments application
      2. Applications
    7. Summary
  7. Spark for Big Data Analytics
    1. The advent of Spark
      1. Limitations of Hadoop
      2. Overcoming the limitations of Hadoop
      3. Theoretical concepts in Spark
        1. Resilient distributed datasets
        2. Directed acyclic graphs
        3. SparkContext
        4. Spark DataFrames
        5. Actions and transformations
        6. Spark deployment options
        7. Spark APIs
      4. Core components in Spark
        1. Spark Core
        2. Spark SQL
        3. Spark Streaming
        4. GraphX
        5. MLlib
      5. The architecture of Spark
      6. Spark solutions
    2. Spark practicals
      1. Signing up for Databricks Community Edition
    3. Spark exercise - hands-on with Spark (Databricks)
    4. Summary
  8. An Introduction to Machine Learning Concepts
    1. What is machine learning?
      1. The evolution of machine learning
    2. Factors that led to the success of machine learning
    3. Machine learning, statistics, and AI
    4. Categories of machine learning
      1. Supervised and unsupervised machine learning
        1. Supervised machine learning
          1. Vehicle Mileage, Number Recognition and other examples
        2. Unsupervised machine learning
    5. Subdividing supervised machine learning
    6. Common terminologies in machine learning
    7. The core concepts in machine learning
      1. Data management steps in machine learning
        1. Pre-processing and feature selection techniques
          1. Centering and scaling
        2. The near-zero variance function
        3. Removing correlated variables
        4. Other common data transformations
        5. Data sampling
        6. Data imputation
        7. The importance of variables
      2. The train, test splits, and cross-validation concepts
        1. Splitting the data into train and test sets
        2. The cross-validation parameter
          1. Creating the model
    8. Leveraging multicore processing in the model
    9. Summary
  9. Machine Learning Deep Dive
    1. The bias, variance, and regularization properties
    2. The gradient descent and VC Dimension theories
    3. Popular machine learning algorithms
      1. Regression models
      2. Association rules
        1. Confidence
        2. Support
        3. Lift
      3. Decision trees
      4. The Random forest extension
      5. Boosting algorithms
      6. Support vector machines
      7. The K-Means machine learning technique
      8. The neural networks related algorithms
    4. Tutorial - associative rules mining with CMS data
      1. Downloading the data
      2. Writing the R code for Apriori
      3. Shiny (R Code)
      4. Using custom CSS and fonts for the application
      5. Running the application
    5. Summary
  10. Enterprise Data Science
    1. Enterprise data science overview
    2. A roadmap to enterprise analytics success
    3. Data science solutions in the enterprise
      1. Enterprise data warehouse and data mining
      2. Traditional data warehouse systems
        1. Oracle Exadata, Exalytics, and TimesTen
        2. HP Vertica
        3. Teradata
        4. IBM data warehouse systems (formerly Netezza appliances)
        5. PostgreSQL
        6. Greenplum
        7. SAP Hana
      3. Enterprise and open source NoSQL Databases
        1. Kdb+
        2. MongoDB
        3. Cassandra
        4. Neo4j
      4. Cloud databases
        1. Amazon Redshift, Redshift Spectrum, and Athena databases
        2. Google BigQuery and other cloud services
        3. Azure CosmosDB
      5. GPU databases
        1. Brytlyt
        2. MapD
      6. Other common databases
    4. Enterprise data science – machine learning and AI
      1. The R programming language
      2. Python
      3. OpenCV, Caffe, and others
      4. Spark
      5. Deep learning
      6. H2O and Driverless AI
      7. Datarobot
      8. Command-line tools
      9. Apache MADlib
      10. Machine learning as a service
    5. Enterprise infrastructure solutions
      1. Cloud computing
      2. Virtualization
      3. Containers – Docker, Kubernetes, and Mesos
      4. On-premises hardware
      5. Enterprise Big Data
    6. Tutorial – using RStudio in the cloud
    7. Summary
  11. Closing Thoughts on Big Data
    1. Corporate big data and data science strategy
    2. Ethical considerations
    3. Silicon Valley and data science
    4. The human factor
      1. Characteristics of successful projects
    5. Summary
  12. External Data Science Resources
    1. Big data resources
    2. NoSQL products
    3. Languages and tools
    4. Creating dashboards
    5. Notebooks
    6. Visualization libraries
    7. Courses on R
    8. Courses on machine learning
    9. Machine learning and deep learning links
    10. Web-based machine learning services
    11. Movies
    12. Machine learning books from Packt
    13. Books for leisure reading
  13. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Practical Big Data Analytics
  • Author(s): Nataraj Dasgupta
  • Release date: January 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781783554393