Hadoop Security

Book description

As more corporations turn to Hadoop to store and process their most valuable data, the risk of a potential breach of those systems increases exponentially. This practical book not only shows Hadoop administrators and security architects how to protect Hadoop data from unauthorized access, it also shows how to limit the ability of an attacker to corrupt or modify data in the event of a security breach.

Authors Ben Spivey and Joey Echeverria provide in-depth information about the security features available in Hadoop, and organize them according to common computer security concepts. You’ll also get real-world examples that demonstrate how you can apply these concepts to your use cases.

  • Understand the challenges of securing distributed systems, particularly Hadoop
  • Use best practices for preparing Hadoop cluster hardware as securely as possible
  • Get an overview of the Kerberos network authentication protocol
  • Delve into authorization and accounting principles as they apply to Hadoop
  • Learn how to use mechanisms to protect data in a Hadoop cluster, both in transit and at rest
  • Integrate Hadoop data ingest into enterprise-wide security architecture
  • Ensure that security architecture reaches all the way to end-user access

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Audience
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
      1. From Joey
      2. From Ben
      3. From Eddie
    7. Disclaimer
  3. 1. Introduction
    1. Security Overview
      1. Confidentiality
      2. Integrity
      3. Availability
      4. Authentication, Authorization, and Accounting
    2. Hadoop Security: A Brief History
    3. Hadoop Components and Ecosystem
      1. Apache HDFS
      2. Apache YARN
      3. Apache MapReduce
      4. Apache Hive
      5. Cloudera Impala
      6. Apache Sentry (Incubating)
      7. Apache HBase
      8. Apache Accumulo
      9. Apache Solr
      10. Apache Oozie
      11. Apache ZooKeeper
      12. Apache Flume
      13. Apache Sqoop
      14. Cloudera Hue
    4. Summary
  4. I. Security Architecture
  5. 2. Securing Distributed Systems
    1. Threat Categories
      1. Unauthorized Access/Masquerade
      2. Insider Threat
      3. Denial of Service
      4. Threats to Data
    2. Threat and Risk Assessment
      1. User Assessment
      2. Environment Assessment
    3. Vulnerabilities
    4. Defense in Depth
    5. Summary
  6. 3. System Architecture
    1. Operating Environment
    2. Network Security
      1. Network Segmentation
      2. Network Firewalls
      3. Intrusion Detection and Prevention
    3. Hadoop Roles and Separation Strategies
      1. Master Nodes
      2. Worker Nodes
      3. Management Nodes
      4. Edge Nodes
    4. Operating System Security
      1. Remote Access Controls
      2. Host Firewalls
      3. SELinux
    5. Summary
  7. 4. Kerberos
    1. Why Kerberos?
    2. Kerberos Overview
    3. Kerberos Workflow: A Simple Example
    4. Kerberos Trusts
    5. MIT Kerberos
      1. Server Configuration
      2. Client Configuration
    6. Summary
  8. II. Authentication, Authorization, and Accounting
  9. 5. Identity and Authentication
    1. Identity
      1. Mapping Kerberos Principals to Usernames
      2. Hadoop User to Group Mapping
      3. Provisioning of Hadoop Users
    2. Authentication
      1. Kerberos
      2. Username and Password Authentication
      3. Tokens
      4. Impersonation
      5. Configuration
    3. Summary
  10. 6. Authorization
    1. HDFS Authorization
      1. HDFS Extended ACLs
    2. Service-Level Authorization
    3. MapReduce and YARN Authorization
      1. MapReduce (MR1)
      2. YARN (MR2)
    4. ZooKeeper ACLs
    5. Oozie Authorization
    6. HBase and Accumulo Authorization
      1. System, Namespace, and Table-Level Authorization
      2. Column- and Cell-Level Authorization
    7. Summary
  11. 7. Apache Sentry (Incubating)
    1. Sentry Concepts
    2. The Sentry Service
      1. Sentry Service Configuration
    3. Hive Authorization
      1. Hive Sentry Configuration
    4. Impala Authorization
      1. Impala Sentry Configuration
    5. Solr Authorization
      1. Solr Sentry Configuration
    6. Sentry Privilege Models
      1. SQL Privilege Model
      2. Solr Privilege Model
    7. Sentry Policy Administration
      1. SQL Commands
      2. SQL Policy File
      3. Solr Policy File
      4. Policy File Verification and Validation
      5. Migrating From Policy Files
    8. Summary
  12. 8. Accounting
    1. HDFS Audit Logs
    2. MapReduce Audit Logs
    3. YARN Audit Logs
    4. Hive Audit Logs
    5. Cloudera Impala Audit Logs
    6. HBase Audit Logs
    7. Accumulo Audit Logs
    8. Sentry Audit Logs
    9. Log Aggregation
    10. Summary
  13. III. Data Security
  14. 9. Data Protection
    1. Encryption Algorithms
    2. Encrypting Data at Rest
      1. Encryption and Key Management
      2. HDFS Data-at-Rest Encryption
      3. MapReduce2 Intermediate Data Encryption
      4. Impala Disk Spill Encryption
      5. Full Disk Encryption
      6. Filesystem Encryption
      7. Important Data Security Consideration for Hadoop
    3. Encrypting Data in Transit
      1. Transport Layer Security
      2. Hadoop Data-in-Transit Encryption
    4. Data Destruction and Deletion
    5. Summary
  15. 10. Securing Data Ingest
    1. Integrity of Ingested Data
    2. Data Ingest Confidentiality
      1. Flume Encryption
      2. Sqoop Encryption
    3. Ingest Workflows
    4. Enterprise Architecture
    5. Summary
  16. 11. Data Extraction and Client Access Security
    1. Hadoop Command-Line Interface
    2. Securing Applications
    3. HBase
      1. HBase Shell
      2. HBase REST Gateway
      3. HBase Thrift Gateway
    4. Accumulo
      1. Accumulo Shell
      2. Accumulo Proxy Server
    5. Oozie
    6. Sqoop
    7. SQL Access
      1. Impala
      2. Hive
    8. WebHDFS/HttpFS
    9. Summary
  17. 12. Cloudera Hue
    1. Hue HTTPS
    2. Hue Authentication
      1. SPNEGO Backend
      2. SAML Backend
      3. LDAP Backend
    3. Hue Authorization
    4. Hue SSL Client Configurations
    5. Summary
  18. IV. Putting It All Together
  19. 13. Case Studies
    1. Case Study: Hadoop Data Warehouse
      1. Environment Setup
      2. User Experience
      3. Summary
    2. Case Study: Interactive HBase Web Application
      1. Design and Architecture
      2. Security Requirements
      3. Cluster Configuration
      4. Implementation Notes
      5. Summary
  20. Afterword
    1. Unified Authorization
    2. Data Governance
    3. Native Data Protection
    4. Final Thoughts
  21. Index

Product information

  • Title: Hadoop Security
  • Author(s): Ben Spivey, Joey Echeverria
  • Release date: June 2015
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491901342