Book description
Over the past 5 years, the concept of big data has matured, data science has grown exponentially, and data architecture has become a standard part of organizational decision-making. Throughout all this change, the basic principles that shape the architecture of data have remained the same. There remains a need for people to take a look at the "bigger picture" and to understand where their data fit into the grand scheme of things.
Data Architecture: A Primer for the Data Scientist, Second Edition addresses the larger architectural picture of how big data fits within the existing information infrastructure or data warehousing systems. This is an essential topic not only for data scientists, analysts, and managers but also for researchers and engineers who increasingly need to deal with large and complex sets of data. Until data are gathered and can be placed into an existing framework or architecture, they cannot be used to their full potential. Drawing upon years of practical experience and using numerous examples and case studies from across various industries, the authors seek to explain this larger picture into which big data fits, giving data scientists the necessary context for how pieces of the puzzle should fit together.
- New case studies include expanded coverage of textual management and analytics
- New chapters on visualization and big data
- Discussion of new visualizations of the end-state architecture
Table of contents
- Cover image
- Title page
- Table of Contents
- Copyright
- Dedication
- Chapter 1.1: An Introduction to Data Architecture
- Chapter 1.2: The Data Infrastructure
- Chapter 1.3: The “Great Divide”
- Chapter 1.4: Demographics of Corporate Data
- Chapter 1.5: Corporate Data Analysis
- Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time
- Chapter 1.7: A Brief History of Data
-
Chapter 2.1: The End-State Architecture—The “World Map”
- Abstract
- Architectural Components
- Different Kinds of Data in the End State Architecture
- Shaping the Data Through Models
- Where Is the Data Warehouse?
- Where Different Types of Questions Are Answered Across the End State Architecture
- Data in the Data Lake
- Metadata in the End State Architecture
- Networked Metadata
- An Evolutionary Experience
- The Data Lake Architecture
- Chapter 3.1: Transformations in the End-State Architecture
- Chapter 4.1: A Brief History of Big Data
- Chapter 4.2: What Is Big Data?
- Chapter 4.3: Parallel Processing
- Chapter 4.4: Unstructured Data
- Chapter 4.5: Contextualizing Repetitive Unstructured Data
- Chapter 4.6: Textual Disambiguation
-
Chapter 4.7: Taxonomies
- Abstract
- Data Models/Taxonomies
- Applicability of Taxonomies
- What Is a Taxonomy?
- Taxonomies in Multiple Languages
- Commercial or Private Taxonomies?
- Dynamics of Taxonomies and Textual Disambiguation
- Taxonomies and Textual Disambiguation—Separate Technologies
- Different Types of Taxonomies
- Taxonomies—Maintenance Over Time
- Chapter 5.1: The Siloed Application Environment
- Chapter 6.1: Introduction to Data Vault 2.0
-
Chapter 6.2: Introduction to Data Vault Modeling
- Abstract
- What Is a Data Vault Model Concept?
- Data Vault Model Defined
- Components of a Data Vault Model
- What Makes Business Keys So Interesting?
- What Does This Have to Do With Data Vault and Data Warehousing?
- How Does This Translate to Data Vault Modeling?
- Why Restructure the Data From the Staging Area?
- What Are the Basic Rules of the Data Vault Model?
- Why Do We Need Many to Many Link Structures?
- Primary Key Options for Data Vault 2.0
- Chapter 6.3: Introduction to Data Vault Architecture
-
Chapter 6.4: Introduction to Data Vault Methodology
- Abstract
- Data Vault 2.0 Methodology Overview
- How Does CMMI Contribute to the Methodology?
- If CMMI Is So Great, Why Should We Care About Agility Then?
- Why Include PMP, SDLC If CMMI and Agile Should Be All That's Needed?
- So Then, What Does Six Sigma Contribute to the Data Vault 2 Methodology?
- Where Does TQM (Total Quality Management) Fit in to All of This?
- Chapter 6.5: Introduction to Data Vault Implementation
- Chapter 7.1: The Operational Environment: A Short History
- Chapter 7.2: The Standard Work Unit
- Chapter 7.3: Data Modeling for the Structured Environment
- Chapter 8.1: A Brief History of Data Architecture
-
Chapter 8.2: Big Data/Existing System Interface
- Abstract
- The Big Data/Existing Systems Interface
- The Repetitive Raw Big Data/Existing Systems Interface
- Exception Based Data
- The Nonrepetitive Raw Big Data/Existing Systems Interface
- Into the Existing Systems Environment
- The “Context Enriched” Big Data Environment
- Analyzing Structured Data/Unstructured Data Together
- Chapter 8.3: The Data Warehouse/Operational Environment Interface
- Chapter 8.4: Data Architecture: A High-Level Perspective
-
Chapter 9.1: Repetitive Analytics: Some Basics
- Abstract
- Different Kinds of Analysis
- Looking for Patterns
- Heuristic Processing
- Freezing Data
- The Sandbox
- The “Normal” Profile
- Distillation, Filtering
- Subsetting Data
- Bias of the Sample
- Filtering Data
- Repetitive Data and Context
- Linking Repetitive Records
- Log Tape Records
- Analyzing Points of Data
- Outliers
- Data Over Time
- Chapter 9.2: Analyzing Repetitive Data
- Chapter 9.3: Repetitive Analysis
-
Chapter 10.1: Nonrepetitive Data
- Abstract
- Inline Contextualization
- Taxonomy/Ontology Processing
- Custom Variables
- Homographic Resolution
- Acronym Resolution
- Negation Analysis
- Numeric Tagging
- Date Tagging
- Date Standardization
- List Processing
- Associative Word Processing
- Stop Word Processing
- Word Stemming
- Document Metadata
- Document Classification
- Proximity Analysis
- Functional Sequencing Within Textual ETL
- Internal Referential Integrity
- Preprocessing, Postprocessing
- Chapter 10.2: Mapping
- Chapter 10.3: Analytics From Nonrepetitive Data
- Chapter 11.1: Operational Analytics: Response Time
- Chapter 12.1: Operational Analytics
- Chapter 13.1: Personal Analytics
- Chapter 14.1: Data Models Across the End-State Architecture
-
Chapter 15.1: The System of Record
- Abstract
- The End User Cycle of Awareness
- The System of Record
- The System of Record in the End State Architecture
- The Role of Age in the System of Record
- A Simple Example
- The Flow of Data in the System of Record
- Other Data Than the System of Record
- Is Data Updated in the System of Record?
- Detailed and Summary Data in the System of Record
- Auditing Data and the System of Record
- Text and the System of Record
- Chapter 16.1: Business Value and the End-State Architecture
- Chapter 17.1: Managing Text
- Chapter 18.1: An Introduction to Data Visualizations
- Glossary
- Index
Product information
- Title: Data Architecture: A Primer for the Data Scientist, 2nd Edition
- Author(s):
- Release date: April 2019
- Publisher(s): Academic Press
- ISBN: 9780128169179
You might also like
book
Data Architecture: A Primer for the Data Scientist
Today, the world is trying to create and educate data scientists because of the phenomenon of …
book
Practical Statistics for Data Scientists, 2nd Edition
Statistical methods are a key part of data science, yet few data scientists have formal statistical …
book
Data Engineering with Python
Build, monitor, and manage real-time data pipelines to create data engineering infrastructure efficiently using open-source Apache …
book
Data Science from Scratch, 2nd Edition
To really learn data science, you should not only master the tools—data science libraries, frameworks, modules, …