Simulation for Data Science with R

Book description

Harness actionable insights from your data with computational statistics and simulations using R

About This Book

  • Learn five different simulation techniques (Monte Carlo, Discrete Event Simulation, System Dynamics, Agent-Based Modeling, and Resampling) in-depth using real-world case studies
  • A unique book that teaches you the essential and fundamental concepts in statistical modeling and simulation

Who This Book Is For

This book is for users who are familiar with computational methods. If you want to learn about the advanced features of R, including the computer-intense Monte-Carlo methods as well as computational tools for statistical simulation, then this book is for you. Good knowledge of R programming is assumed/required.

What You Will Learn

  • The book aims to explore advanced R features to simulate data to extract insights from your data.
  • Get to know the advanced features of R including high-performance computing and advanced data manipulation
  • See random number simulation used to simulate distributions, data sets, and populations
  • Simulate close-to-reality populations as the basis for agent-based micro-, model- and design-based simulations
  • Applications to design statistical solutions with R for solving scientific and real world problems
  • Comprehensive coverage of several R statistical packages like boot, simPop, VIM, data.table, dplyr, parallel, StatDA, simecol, simecolModels, deSolve and many more.

In Detail

Data Science with R aims to teach you how to begin performing data science tasks by taking advantage of Rs powerful ecosystem of packages. R being the most widely used programming language when used with data science can be a powerful combination to solve complexities involved with varied data sets in the real world.

The book will provide a computational and methodological framework for statistical simulation to the users. Through this book, you will get in grips with the software environment R. After getting to know the background of popular methods in the area of computational statistics, you will see some applications in R to better understand the methods as well as gaining experience of working with real-world data and real-world problems. This book helps uncover the large-scale patterns in complex systems where interdependencies and variation are critical. An effective simulation is driven by data generating processes that accurately reflect real physical populations. You will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results.

By the end of this book, you reader will get in touch with the software environment R. After getting background on popular methods in the area, you will see applications in R to better understand the methods as well as to gain experience when working on real-world data and real-world problems.

Style and approach

This book takes a practical, hands-on approach to explain the statistical computing methods, gives advice on the usage of these methods, and provides computational tools to help you solve common problems in statistical simulation and computer-intense methods.

Table of contents

  1. Simulation for Data Science with R
    1. Table of Contents
    2. Simulation for Data Science with R
    3. Credits
    4. About the Author
    5. About the Reviewer
    6. www.PacktPub.com
      1. eBooks, discount offers, and more
        1. Why subscribe?
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Introduction
      1. What is simulation and where is it applied?
      2. Why use simulation?
      3. Simulation and big data
      4. Choosing the right simulation technique
      5. Summary
      6. References
    9. 2. R and High-Performance Computing
      1. The R statistical environment
        1. Basics in R
        2. Some very basic stuff about R
        3. Installation and updates
        4. Help
        5. The R workspace and the working directory
        6. Data types
          1. Vectors in R
          2. Factors in R
          3. list
          4. data.frame
          5. array
        7. Missing values
      2. Generic functions, methods, and classes
      3. Data manipulation in R
        1. Apply and friends with basic R
        2. Basic data manipulation with the dplyr package
          1. dplyr – creating a local data frame
          2. dplyr – selecting lines
          3. dplyr – order
          4. dplyr – selecting columns
          5. dplyr – uniqueness
          6. dplyr – creating variables
          7. dplyr – grouping and aggregates
          8. dplyr – window functions
        3. Data manipulation with the data.table package
          1. data.table – variable construction
          2. data.table – indexing or subsetting
          3. data.table – keys
          4. data.table – fast subsetting
          5. data.table – calculations in groups
      4. High performance computing
        1. Profiling to detect computationally slow functions in code
          1. Further benchmarking
        2. Parallel computing
        3. Interfaces to C++
      5. Visualizing information
        1. The graphics system in R
        2. The graphics package
          1. Warm-up example – a high-level plot
          2. Control of graphics parameters
        3. The ggplot2 package
      6. References
    10. 3. The Discrepancy between Pencil-Driven Theory and Data-Driven Computational Solutions
      1. Machine numbers and rounding problems
        1. Example – the 64-bit representation of numbers
        2. Convergence in the deterministic case
        3. Example – convergence
      2. Condition of problems
      3. Summary
      4. References
    11. 4. Simulation of Random Numbers
      1. Real random numbers
      2. Simulating pseudo random numbers
        1. Congruential generators
        2. Linear and multiplicative congruential generators
        3. Lagged Fibonacci generators
        4. More generators
      3. Simulation of non-uniform distributed random variables
        1. The inversion method
        2. The alias method
        3. Estimation of counts in tables with log-linear models
        4. Rejection sampling
            1. Simulating values from a normal distribution
            2. Simulating random numbers from a Beta distribution
        5. Truncated distributions
        6. Metropolis - Hastings algorithm
          1. A few words on Markov chains
          2. The Metropolis sampler
        7. The Gibbs sampler
          1. The two-phase Gibbs sampler
          2. The multiphase Gibbs sampler
          3. Application in linear regression
        8. The diagnosis of MCMC samples
      4. Tests for random numbers
        1. The evaluation of random numbers – an example of a test
      5. Summary
      6. References
    12. 5. Monte Carlo Methods for Optimization Problems
      1. Numerical optimization
        1. Gradient ascent/descent
        2. Newton-Raphson methods
        3. Further general-purpose optimization methods
      2. Dealing with stochastic optimization
        1. Simplified procedures (Star Trek, Spaceballs, and Spaceballs princess)
        2. Metropolis-Hastings revisited
        3. Gradient-based stochastic optimization
      3. Summary
      4. References
    13. 6. Probability Theory Shown by Simulation
      1. Some basics on probability theory
      2. Probability distributions
        1. Discrete probability distributions
        2. Continuous probability distributions
      3. Winning the lottery
      4. The weak law on large numbers
        1. Emperor penguins and your boss
          1. Limits and convergence of random variables
          2. Convergence of the sample mean – weak law of large numbers
          3. Showing the weak law of large numbers by simulation
      5. The central limit theorem
      6. Properties of estimators
        1. Properties of estimators
        2. Confidence intervals
        3. A note on robust estimators
      7. Summary
      8. References
    14. 7. Resampling Methods
      1. The bootstrap
        1. A motivating example with odds ratios
        2. Why the bootstrap works
        3. A closer look at the bootstrap
        4. The plug-in principle
      2. Estimation of standard errors with bootstrapping
        1. An example of a complex estimation using the bootstrap
      3. The parametric bootstrap
      4. Estimating bias with bootstrap
        1. Confidence intervals by bootstrap
      5. The jackknife
        1. Disadvantages of the jackknife
        2. The delete-d jackknife
        3. Jackknife after bootstrap
      6. Cross-validation
        1. The classical linear regression model
        2. The basic concept of cross validation
        3. Classical cross validation – 70/30 method
        4. Leave-one-out cross validation
        5. k-fold cross validation
      7. Summary
      8. References
    15. 8. Applications of Resampling Methods and Monte Carlo Tests
      1. The bootstrap in regression analysis
        1. Motivation to use the bootstrap
          1. The most popular but often worst method
          2. Bootstrapping by draws from residuals
      2. Proper variance estimation with missing values
      3. Bootstrapping in time series
      4. Bootstrapping in the case of complex sampling designs
      5. Monte Carlo tests
        1. A motivating example
        2. The permutation test as a special kind of MC test
        3. A Monte Carlo test for multiple groups
        4. Hypothesis testing using a bootstrap
        5. A test for multivariate normality
        6. Size of the test
        7. Power comparisons
      6. Summary
        1. References
    16. 9. The EM Algorithm
      1. The basic EM algorithm
        1. Some prerequisites
        2. Formal definition of the EM algorithm
        3. Introductory example for the EM algorithm
      2. The EM algorithm by example of k-means clustering
      3. The EM algorithm for the imputation of missing values
      4. Summary
      5. References
    17. 10. Simulation with Complex Data
      1. Different kinds of simulation and software
      2. Simulating data using complex models
        1. A model-based simple example
        2. A model-based example with mixtures
        3. Model-based approach to simulate data
        4. An example of simulating high-dimensional data
        5. Simulating finite populations with cluster or hierarchical structures
      3. Model-based simulation studies
          1. Latent model example continued
          2. A simple example of model-based simulation
          3. A model-based simulation study
      4. Design-based simulation
        1. An example with complex survey data
        2. Simulation of the synthetic population
        3. Estimators of interest
        4. Defining the sampling design
        5. Using stratified sampling
        6. Adding contamination
        7. Performing simulations separately on different domains
      5. Inserting missing values
      6. Summary
        1. References
    18. 11. System Dynamics and Agent-Based Models
      1. Agent-based models
      2. Dynamics in love and hate
      3. Dynamic systems in ecological modeling
      4. Summary
      5. References
    19. Index

Product information

  • Title: Simulation for Data Science with R
  • Author(s): Matthias Templ
  • Release date: June 2016
  • Publisher(s): Packt Publishing
  • ISBN: 9781785881169