Data Architecture: A Primer for the Data Scientist, 2nd Edition

Book description

Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things.

Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together.

  • New case studies include expanded coverage of textual management and analytics
  • New chapters on visualization and big data
  • Discussion of new visualizations of the end-state architecture

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Dedication
  6. Chapter 1.1: An Introduction to Data Architecture
    1. Abstract
    2. Subdividing Data
    3. Repetitive/Nonrepetitive Unstructured Data
    4. The Great Divide of Data
    5. Textual/Nontextual Data
    6. The Different Forms of Data
    7. Business Value
  7. Chapter 1.2: The Data Infrastructure
    1. Abstract
    2. Two Types of Repetitive Data
    3. Repetitive Structured Data
    4. Repetitive Big Data
    5. The Two Infrastructures
    6. What's Being Optimized?
    7. Comparing the Two Infrastructures
  8. Chapter 1.3: The “Great Divide”
    1. Abstract
    2. Classifying Corporate Data
    3. The “Great Divide”
    4. Repetitive Unstructured Data
    5. Nonrepetitive Unstructured Data
    6. Different Worlds
  9. Chapter 1.4: Demographics of Corporate Data
    1. Abstract
  10. Chapter 1.5: Corporate Data Analysis
    1. Abstract
  11. Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
    1. Abstract
  12. Chapter 1.7: A Brief History of Data
    1. Abstract
    2. Paper Tape and Punch Cards
    3. Magnetic Tapes
    4. Disk Storage
    5. Data Base Management System (DBMS)
    6. Coupled Processors
    7. Online Transaction Processing
    8. Data Warehouse
    9. Parallel Data Management
    10. Data Vault
    11. Big Data
    12. The Great Divide
  13. Chapter 2.1: The End-State Architecture—The “World Map”
    1. Abstract
    2. Architectural Components
    3. Different Kinds of Data in the End State Architecture
    4. Shaping the Data Through Models
    5. Where Is the Data Warehouse?
    6. Where Different Types of Questions Are Answered Across the End State Architecture
    7. Data in the Data Lake
    8. Metadata in the End State Architecture
    9. Networked Metadata
    10. An Evolutionary Experience
    11. The Data Lake Architecture
  14. Chapter 3.1: Transformations in the End-State Architecture
    1. Abstract
    2. Redundant Data
    3. Transformations
    4. Customizing Data
    5. Transforming Text
    6. Transforming Application Data
    7. Transforming Data Into a Customized State
    8. Transforming Data Into Bulk Storage
    9. Transforming Data Generated Automatically
    10. Transforming Bulk Data
    11. Transformation and Redundancy
  15. Chapter 4.1: A Brief History of Big Data
    1. Abstract
    2. An Analogy—Taking the High Ground
    3. Taking the High Ground
    4. Standardization With the 360
    5. Online Transaction Processing
    6. Enter Teradata and MPP Processing
    7. Then Came Hadoop and Big Data
    8. IBM and Hadoop
    9. Holding the High Ground
  16. Chapter 4.2: What Is Big Data?
    1. Abstract
    2. Another Definition
    3. Large Volumes
    4. Inexpensive Storage
    5. The Roman Census Approach
    6. Unstructured Data
    7. Data in Big Data
    8. Context in Repetitive Data
    9. Nonrepetitive Data
    10. Context in Nonrepetitive Data
  17. Chapter 4.3: Parallel Processing
    1. Abstract
  18. Chapter 4.4: Unstructured Data
    1. Abstract
    2. Textual Information—Everywhere
    3. Decisions Based on Structured Data
    4. The Business Value Proposition
    5. Repetitive and Nonrepetitive Unstructured Information
    6. Ease of Analysis
    7. Contextualization
    8. Some Approaches to Contextualization
    9. Map Reduce
    10. Manual Analysis
  19. Chapter 4.5: Contextualizing Repetitive Unstructured Data
    1. Abstract
    2. Parsing Repetitive Unstructured Data
    3. Recasting the Output Data
  20. Chapter 4.6: Textual Disambiguation
    1. Abstract
    2. From Narrative Into an Analytical Data Base
    3. Input Into Textual Disambiguation
    4. Mapping
    5. Input/Output
    6. Document Fracturing/Named Value Processing
    7. Preprocessing a Document
    8. E-mails—A Special Case
    9. Spreadsheets
    10. Report Decompilation
  21. Chapter 4.7: Taxonomies
    1. Abstract
    2. Data Models/Taxonomies
    3. Applicability of Taxonomies
    4. What Is a Taxonomy?
    5. Taxonomies in Multiple Languages
    6. Commercial or Private Taxonomies?
    7. Dynamics of Taxonomies and Textual Disambiguation
    8. Taxonomies and Textual Disambiguation—Separate Technologies
    9. Different Types of Taxonomies
    10. Taxonomies—Maintenance Over Time
  22. Chapter 5.1: The Siloed Application Environment
    1. Abstract
    2. The Challenge of Siloed Applications
    3. Building Siloed Applications
    4. What Does a Siloed Application Look Like?
    5. Current Valued Data
    6. Minimal Historical Data
    7. High Availability
    8. Overlap Between Siloed Applications
    9. Frozen Business Requirements
    10. Dismantling Siloed Applications
  23. Chapter 6.1: Introduction to Data Vault 2.0
    1. Abstract
    2. Data Vault Origins and Background
    3. What Is Data Vault 2.0 Modeling?
    4. How Is Data Vault 2.0 Methodology Defined?
    5. Why Do We Need a Data Vault 2.0 Architecture?
    6. Where Does Data Vault 2.0 Implementation Fit?
    7. What Are the Business Benefits of Data Vault 2.0?
    8. What Is Data Vault 1.0?
  24. Chapter 6.2: Introduction to Data Vault Modeling
    1. Abstract
    2. What Is a Data Vault Model Concept?
    3. Data Vault Model Defined
    4. Components of a Data Vault Model
    5. What Makes Business Keys So Interesting?
    6. What Does This Have to Do With Data Vault and Data Warehousing?
    7. How Does This Translate to Data Vault Modeling?
    8. Why Restructure the Data From the Staging Area?
    9. What Are the Basic Rules of the Data Vault Model?
    10. Why Do We Need Many to Many Link Structures?
    11. Primary Key Options for Data Vault 2.0
  25. Chapter 6.3: Introduction to Data Vault Architecture
    1. Abstract
    2. What Is a Data Vault 2.0 Architecture?
    3. How Does NoSQL Fit in to the Architecture?
    4. What Are the Objectives of the Data Vault 2.0 Architecture?
    5. What Is the Objective of the Data Vault 2.0 Model?
    6. What Are Hard and Soft Business Rules?
    7. How Does Managed Self Service BI Fit in the Architecture?
  26. Chapter 6.4: Introduction to Data Vault Methodology
    1. Abstract
    2. Data Vault 2.0 Methodology Overview
    3. How Does CMMI Contribute to the Methodology?
    4. If CMMI Is So Great, Why Should We Care About Agility Then?
    5. Why Include PMP, SDLC If CMMI and Agile Should Be All That's Needed?
    6. So Then, What Does Six Sigma Contribute to the Data Vault 2 Methodology?
    7. Where Does TQM (Total Quality Management) Fit in to All of This?
  27. Chapter 6.5: Introduction to Data Vault Implementation
    1. Abstract
    2. Implementation Overview
    3. What's So Important About Patterns?
    4. Why Does Reengineering Happen Because of Big Data?
    5. Why Do We Need to Virtualize Our Data Marts?
    6. What Is Managed Self-Service BI?
  28. Chapter 7.1: The Operational Environment: A Short History
    1. Abstract
    2. Commercial Uses of the Computer
    3. The First Applications
    4. Ed Yourdon and the Structured Revolution
    5. The SDLC
    6. Disk Technology
    7. Enter the DBMS
    8. Response Time and Availability
    9. Corporate Computing Today
  29. Chapter 7.2: The Standard Work Unit
    1. Abstract
    2. Elements of Response Time
    3. An Hourglass Analogy
    4. The Racetrack Analogy
    5. Your Vehicle Runs as Fast as the Vehicle in Front of It
    6. The Standard Work Unit
    7. The SLA
  30. Chapter 7.3: Data Modeling for the Structured Environment
    1. Abstract
    2. The Purpose of the Roadmap
    3. Granular Data Only
    4. The ERD
    5. The Dis
    6. Physical Data Base Design
    7. Relating the Different Levels of the Data Model
    8. An Example of the Linkage
    9. Generic Data Models
    10. Operational Data Models/Data Warehouse Data Models
  31. Chapter 8.1: A Brief History of Data Architecture
    1. Abstract
  32. Chapter 8.2: Big Data/Existing System Interface
    1. Abstract
    2. The Big Data/Existing Systems Interface
    3. The Repetitive Raw Big Data/Existing Systems Interface
    4. Exception Based Data
    5. The Nonrepetitive Raw Big Data/Existing Systems Interface
    6. Into the Existing Systems Environment
    7. The “Context Enriched” Big Data Environment
    8. Analyzing Structured Data/Unstructured Data Together
  33. Chapter 8.3: The Data Warehouse/Operational Environment Interface
    1. Abstract
    2. The Operational/Data Warehouse Interface
    3. The Classical ETL Interface
    4. The ODS and the ETL Interface
    5. The Staging Area
    6. Changed Data Capture
    7. Inline Transformation
    8. ELT Processing
  34. Chapter 8.4: Data Architecture: A High-Level Perspective
    1. Abstract
    2. A High Level Perspective
    3. Redundancy
    4. The System of Record
    5. Different Types of Questions
    6. Different Communities
  35. Chapter 9.1: Repetitive Analytics: Some Basics
    1. Abstract
    2. Different Kinds of Analysis
    3. Looking for Patterns
    4. Heuristic Processing
    5. Freezing Data
    6. The Sandbox
    7. The “Normal” Profile
    8. Distillation, Filtering
    9. Subsetting Data
    10. Bias of the Sample
    11. Filtering Data
    12. Repetitive Data and Context
    13. Linking Repetitive Records
    14. Log Tape Records
    15. Analyzing Points of Data
    16. Outliers
    17. Data Over Time
  36. Chapter 9.2: Analyzing Repetitive Data
    1. Abstract
    2. Log Data
    3. Active/Passive Indexing of Data
    4. Summary/Detailed Data
    5. Metadata in Big Data
    6. Linking Data
  37. Chapter 9.3: Repetitive Analysis
    1. Abstract
    2. Internal, External Data
    3. Universal Identifiers
    4. Security
    5. Filtering, Distillation
    6. Archiving Results
    7. Metrics
  38. Chapter 10.1: Nonrepetitive Data
    1. Abstract
    2. Inline Contextualization
    3. Taxonomy/Ontology Processing
    4. Custom Variables
    5. Homographic Resolution
    6. Acronym Resolution
    7. Negation Analysis
    8. Numeric Tagging
    9. Date Tagging
    10. Date Standardization
    11. List Processing
    12. Associative Word Processing
    13. Stop Word Processing
    14. Word Stemming
    15. Document Metadata
    16. Document Classification
    17. Proximity Analysis
    18. Functional Sequencing Within Textual ETL
    19. Internal Referential Integrity
    20. Preprocessing, Postprocessing
  39. Chapter 10.2: Mapping
    1. Abstract
  40. Chapter 10.3: Analytics From Nonrepetitive Data
    1. Abstract
    2. Call Center Information
    3. Medical Records
  41. Chapter 11.1: Operational Analytics: Response Time
    1. Abstract
    2. Transaction Response Time
  42. Chapter 12.1: Operational Analytics
    1. Abstract
    2. Different Perspectives of Data
    3. Data Marts
    4. The Operational Data Store—ODS
  43. Chapter 13.1: Personal Analytics
    1. Abstract
  44. Chapter 14.1: Data Models Across the End-State Architecture
    1. Abstract
    2. The Different Data Models
    3. Functional Decomposition and Data Flow Diagrams
    4. The Corporate Data Model
    5. The Star Join/Dimensional Data Model
    6. Taxonomies/Ontologies
    7. The Selective Subdivision of Data
    8. Proactive/Reactive Data Models
  45. Chapter 15.1: The System of Record
    1. Abstract
    2. The End User Cycle of Awareness
    3. The System of Record
    4. The System of Record in the End State Architecture
    5. The Role of Age in the System of Record
    6. A Simple Example
    7. The Flow of Data in the System of Record
    8. Other Data Than the System of Record
    9. Is Data Updated in the System of Record?
    10. Detailed and Summary Data in the System of Record
    11. Auditing Data and the System of Record
    12. Text and the System of Record
  46. Chapter 16.1: Business Value and the End-State Architecture
    1. Abstract
    2. The Evolution of the End State Architecture
    3. What is Meant by “Business Value”
    4. Tactical Business Value/Strategic Business Value
    5. Volume of Data Versus Business Value
    6. The “Million in One” Syndrome
    7. Where Business Value Occurs
    8. Data Relevancy Over Time
    9. Where Tactical Decisions Are Made
  47. Chapter 17.1: Managing Text
    1. Abstract
    2. The Challenge of Text
    3. The Challenge of Context
    4. The Processing Components of Textual ETL
    5. Secondary Analysis
    6. Visualization
    7. Merging Text Based Data and Structured Data
  48. Chapter 18.1: An Introduction to Data Visualizations
    1. Abstract
    2. Introduction to Data Visualizations—Overview
    3. Purpose and Context
    4. Visualization—A Science and an Art
    5. Visualization Framework
    6. Step 1: Define
    7. Step 2: Data
    8. Step 3: Design
    9. Step 4: Distribute
    10. Data Visualization Tools and Software
    11. Summary
  49. Glossary
  50. Index

Product information

  • Title: Data Architecture: A Primer for the Data Scientist, 2nd Edition
  • Author(s): W.H. Inmon, Daniel Linstedt, Mary Levins
  • Release date: April 2019
  • Publisher(s): Academic Press
  • ISBN: 9780128169179