Practical Synthetic Data Generation

Book description

Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic data—fake data generated from real data—so you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.

Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.

This book describes:

  • Steps for generating synthetic data using multivariate normal distributions
  • Methods for distribution fitting covering different goodness-of-fit metrics
  • How to replicate the simple structure of original data
  • An approach for modeling data structure to consider complex relationships
  • Multiple approaches and metrics you can use to assess data utility
  • How analysis performed on real data can be replicated with synthetic data
  • Privacy implications of synthetic data and methods to assess identity disclosure

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Conventions Used in This Book
    2. O’Reilly Online Learning
    3. How to Contact Us
    4. Acknowledgments
  2. 1. Introducing Synthetic Data Generation
    1. Defining Synthetic Data
      1. Synthesis from Real Data
      2. Synthesis Without Real Data
      3. Synthesis and Utility
    2. The Benefits of Synthetic Data
      1. Efficient Access to Data
      2. Enabling Better Analytics
      3. Synthetic Data as a Proxy
      4. Learning to Trust Synthetic Data
    3. Synthetic Data Case Studies
      1. Manufacturing and Distribution
      2. Healthcare
      3. Financial Services
      4. Transportation
    4. Summary
  3. 2. Implementing Data Synthesis
    1. When to Synthesize
    2. Identifiability Spectrum
    3. Trade-Offs in Selecting PETs to Enable Data Access
      1. Decision Criteria
      2. PETs Considered
      3. Decision Framework
      4. Examples of Applying the Decision Framework
    4. Data Synthesis Projects
      1. Data Synthesis Steps
      2. Data Preparation
    5. The Data Synthesis Pipeline
    6. Synthesis Program Management
    7. Summary
  4. 3. Getting Started: Distribution Fitting
    1. Framing Data
    2. How Data Is Distributed
    3. Fitting Distributions to Real Data
    4. Generating Synthetic Data from a Distribution
      1. Measuring How Well Synthetic Data Fits a Distribution
      2. The Overfitting Dilemma
      3. A Little Light Weeding
    5. Summary
  5. 4. Evaluating Synthetic Data Utility
    1. Synthetic Data Utility Framework: Replication of Analysis
    2. Synthetic Data Utility Framework: Utility Metrics
      1. Comparing Univariate Distributions
      2. Comparing Bivariate Statistics
      3. Comparing Multivariate Prediction Models
      4. Distinguishability
    3. Summary
  6. 5. Methods for Synthesizing Data
    1. Generating Synthetic Data from Theory
      1. Sampling from a Multivariate Normal Distribution
      2. Inducing Correlations with Specified Marginal Distributions
      3. Copulas with Known Marginal Distributions
    2. Generating Realistic Synthetic Data
      1. Fitting Real Data to Known Distributions
      2. Using Machine Learning to Fit the Distributions
    3. Hybrid Synthetic Data
    4. Machine Learning Methods
    5. Deep Learning Methods
    6. Synthesizing Sequences
    7. Summary
  7. 6. Identity Disclosure in Synthetic Data
    1. Types of Disclosure
      1. Identity Disclosure
      2. Learning Something New
      3. Attribute Disclosure
      4. Inferential Disclosure
      5. Meaningful Identity Disclosure
      6. Defining Information Gain
      7. Bringing It All Together
      8. Unique Matches
    2. How Privacy Law Impacts the Creation and Use of Synthetic Data
      1. Issues Under the GDPR
      2. Issues Under the CCPA
      3. Issues Under HIPAA
      4. Article 29 Working Party Opinion
    3. Summary
  8. 7. Practical Data Synthesis
    1. Managing Data Complexity
      1. For Every Pre-Processing Step There Is a Post-Processing Step
      2. Field Types
      3. The Need for Rules
      4. Not All Fields Have to Be Synthesized
      5. Synthesizing Dates
      6. Synthesizing Geography
      7. Lookup Fields and Tables
      8. Missing Data and Other Data Characteristics
      9. Partial Synthesis
    2. Organizing Data Synthesis
      1. Computing Capacity
      2. A Toolbox of Techniques
      3. Synthesizing Cohorts Versus Full Datasets
      4. Continuous Data Feeds
      5. Privacy Assurance as Certification
      6. Performing Validation Studies to Get Buy-In
      7. Motivated Intruder Tests
      8. Who Owns Synthetic Data?
    3. Conclusions
  9. Index

Product information

  • Title: Practical Synthetic Data Generation
  • Author(s): Khaled El Emam, Lucy Mosquera, Richard Hoptroff
  • Release date: May 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492072744