Intel Xeon Phi Processor High Performance Programming, 2nd Edition

Book description

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world examples. The authors distill their years of Xeon Phi programming experience coupled with insights from many expert customers — Intel Field Engineers, Application Engineers, and Technical Consulting Engineers — to create this authoritative book on the essentials of programming for Intel Xeon Phi products.

Intel® Xeon Phi™ Processor High-Performance Programming is useful even before you ever program a system with an Intel Xeon Phi processor. To help ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi processors, or other high-performance microprocessors. Applying these techniques will generally increase your program performance on any system and prepare you better for Intel Xeon Phi processors.

  • A practical guide to the essentials for programming Intel Xeon Phi processors
  • Definitive coverage of the Knights Landing architecture
  • Presents best practices for portable, high-performance computing and a familiar and proven threads and vectors programming model
  • Includes real world code examples that highlight usages of the unique aspects of this new highly parallel and high-performance computational product
  • Covers use of MCDRAM, AVX-512, Intel® Omni-Path fabric, many-cores (up to 72), and many threads (4 per core)
  • Covers software developer tools, libraries and programming models
  • Covers using Knights Landing as a processor and a coprocessor

Table of contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Acknowledgments
  6. Foreword
    1. Extending the Sports Car Analogy to Higher Performance
    2. What Exactly Is The Unfair Advantage?
    3. Peak Performance Versus Drivable/Usable Performance
    4. How Does The Unfair Advantage Relate to This Book?
    5. Closing Comments
  7. Preface
    1. Sports Car Tutorial: Introduction for Many-Core Is Online
    2. Parallelism Pearls: Inspired by Many Cores
    3. Organization
    4. Structured Parallel Programming
    5. What’s New?
    6. lotsofcores.com
  8. Section I: Knights Landing
    1. Introduction
    2. Chapter 1: Introduction
      1. Abstract
      2. Introduction to Many-Core Programming
      3. Trend: More Parallelism
      4. Why Intel® Xeon Phi™ Processors Are Needed
      5. Processors Versus Coprocessor
      6. Measuring Readiness for Highly Parallel Execution
      7. What About GPUs?
      8. Enjoy the Lack of Porting Needed but Still Tune!
      9. Transformation for Performance
      10. Hyper-Threading Versus Multithreading
      11. Programming Models
      12. Why We Could Skip To Section II Now
      13. For More Information
    3. Chapter 2: Knights Landing overview
      1. Abstract
      2. Overview
      3. Instruction Set
      4. Architecture Overview
      5. Motivation: Our Vision and Purpose
      6. Summary
      7. For More Information
    4. Chapter 3: Programming MCDRAM and Cluster modes
      1. Abstract
      2. Programming for Cluster Modes
      3. Programming for Memory Modes
      4. Query Memory Mode and MCDRAM Available
      5. SNC Performance Implications of Allocation and Threading
      6. How to Not Hard Code the NUMA Node Numbers
      7. Approaches to Determining What to Put in MCDRAM
      8. Why Rebooting Is Required to Change Modes
      9. BIOS
      10. Summary
      11. For More Information
    5. Chapter 4: Knights Landing architecture
      1. Abstract
      2. Tile Architecture
      3. Cluster Modes
      4. Memory Interleaving
      5. Memory Modes
      6. Interactions of Cluster and Memory Modes
      7. Summary
      8. For More Information
    6. Chapter 5: Intel Omni-Path Fabric
      1. Abstract
      2. Overview
      3. Performance and Scalability
      4. Transport Layer APIs
      5. Quality of Service
      6. Virtual Fabrics
      7. Unicast Address Resolution
      8. Multicast Address Resolution
      9. Summary
      10. For More Information
    7. Chapter 6: μarch optimization advice
      1. Abstract
      2. Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
      3. Memory Subsystem
      4. μarch Nuances (tile)
      5. Direct Mapped MCDRAM Cache
      6. Advice: Use AVX-512
      7. Summary
      8. For More Information
  9. Section II: Parallel Programming
    1. Introduction
    2. Chapter 7: Programming overview for Knights Landing
      1. Abstract
      2. To Refactor, or Not to Refactor, That Is the Question
      3. Evolutionary Optimization of Applications
      4. Revolutionary Optimization of Applications
      5. Know When to Hold’em and When to Fold’em
      6. For More Information
    3. Chapter 8: Tasks and threads
      1. Abstract
      2. OpenMP
      3. Fortran 2008
      4. Intel TBB
      5. hStreams
      6. Summary
      7. For More Information
    4. Chapter 9: Vectorization
      1. Abstract
      2. Why Vectorize?
      3. How to Vectorize
      4. Three Approaches to Achieving Vectorization
      5. Six-Step Vectorization Methodology
      6. Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
      7. Compiler Tips
      8. Compiler Options
      9. Compiler Directives
      10. Use Array Sections to Encourage Vectorization
      11. Look at What the Compiler Created: Assembly Code Inspection
      12. Numerical Result Variations With Vectorization
      13. Summary
      14. For More Information
    5. Chapter 10: Vectorization advisor
      1. Abstract
      2. Getting Started With Intel Advisor for Knights Landing
      3. Enabling and Improving AVX-512 Code With the Survey Report
      4. Memory Access Pattern Report
      5. AVX-512 Gather/Scatter Profiler
      6. Mask Utilization and FLOPs Profiler
      7. Advisor Roofline Report
      8. Explore AVX-512 Code Characteristics Without AVX-512 Hardware
      9. Example — Analysis of a Computational Chemistry Code
      10. Summary
      11. For More Information
    6. Chapter 11: Vectorization with SDLT
      1. Abstract
      2. What Is SDLT?
      3. Getting Started
      4. SDLT Basics
      5. Example Normalizing 3d Points With SIMD
      6. What Is Wrong With AOS Memory Layout and SIMD?
      7. SIMD Prefers Unit-Stride Memory Accesses
      8. Alpha-Blended Overlay Reference
      9. Alpha-Blended Overlay With SDLT
      10. Additional Features
      11. Summary
      12. For More Information
    7. Chapter 12: Vectorization with AVX-512 intrinsics
      1. Abstract
      2. What Are Intrinsics?
      3. AVX-512 Overview
      4. Migrating From Knights Corner
      5. AVX-512 Detection
      6. Learning AVX-512 Instructions
      7. Learning AVX-512 Intrinsics
      8. Step-by-Step Example Using AVX-512 Intrinsics
      9. Results Using Our Intrinsics Code
      10. For More Information
    8. Chapter 13: Performance libraries
      1. Abstract
      2. Intel Performance Library Overview
      3. Intel Math Kernel Library Overview
      4. Intel Data Analytics Library Overview
      5. Together: MKL and DAAL
      6. Intel Integrated Performance Primitives Library Overview
      7. Intel Performance Libraries and Intel Compilers
      8. Native (Direct) Library Usage
      9. Offloading to Knights Landing While Using a Library
      10. Precision Choices and Variations
      11. Performance Tip for Faster Dynamic Libraries
      12. For More Information
    9. Chapter 14: Profiling and timing
      1. Abstract
      2. Introduction to Knight Landing Tuning
      3. Event-Monitoring Registers
      4. Efficiency Metrics
      5. Potential Performance Issues
      6. Intel VTune Amplifier XE Product
      7. Performance Application Programming Interface
      8. MPI Analysis: ITAC
      9. HPCToolkit
      10. Tuning and Analysis Utilities
      11. Timing
      12. Summary
      13. For More Information
    10. Chapter 15: MPI
      1. Abstract
      2. Internode Parallelism
      3. MPI on Knights Landing
      4. MPI Overview
      5. How to Run MPI Applications
      6. Analyzing MPI Application Runs
      7. Tuning of MPI Applications
      8. Heterogeneous Clusters
      9. Recent Trends in MPI Coding
      10. Putting it All Together
      11. Summary
      12. For More Information
    11. Chapter 16: PGAS programming models
      1. Abstract
      2. To Share or Not to Share
      3. Why use PGAS on Knights Landing?
      4. Programming with PGAS
      5. Performance Evaluation
      6. Beyond PGAS
      7. Summary
      8. For More Information
    12. Chapter 17: Software-defined visualization
      1. Abstract
      2. Motivation for Software-Defined Visualization
      3. Software-Defined Visualization Architecture
      4. OpenSWR: OpenGL Raster-Graphics Software Rendering
      5. Embree: High-performance Ray Tracing Kernel Library
      6. OSPRay: Scalable Ray Tracing Framework
      7. Summary
      8. Image Attributions
      9. For More Information
    13. Chapter 18: Offload to Knights Landing
      1. Abstract
      2. Offload Programming Model—Using With Knights Landing
      3. Processors Versus Coprocessor
      4. Offload Model Considerations
      5. OpenMP Target Directives
      6. Concurrent Host and Target Execution
      7. Offload Over Fabric
      8. Summary
      9. For More Information
    14. Chapter 19: Power analysis
      1. Abstract
      2. Power Demand Gates Exascale
      3. Power 101
      4. Hardware-Based Power Analysis Techniques
      5. Software-Based Knights Landing Power Analyzer
      6. ManyCore Platform Software Package Power Tools
      7. Running Average Power Limit
      8. Performance Profiling on Knights Landing
      9. Intel Remote Management Module
      10. Summary
      11. For More Information
  10. Section III: Pearls
    1. Introduction
    2. Chapter 20: Optimizing classical molecular dynamics in LAMMPS
      1. Abstract
      2. Acknowledgment
      3. Molecular Dynamics
      4. LAMMPS
      5. Knights Landing Processors
      6. LAMMPS Optimizations
      7. Data Alignment
      8. Data Types and Layout
      9. Vectorization
      10. Neighbor List
      11. Long-Range Electrostatics
      12. MPI and OpenMP Parallelization
      13. Performance Results
      14. System, Build, and Run Configurations
      15. Workloads
      16. Organic Photovoltaic Molecules
      17. Hydrocarbon Mixtures
      18. Rhodopsin Protein in Solvated Lipid Bilayer
      19. Coarse Grain Liquid Crystal Simulation
      20. Coarse-Grain Water Simulation
      21. Summary
      22. For More Information
    3. Chapter 21: High performance seismic simulations
      1. Abstract
      2. High-Order Seismic Simulations
      3. Numerical Background
      4. Application Characteristics
      5. Intel Architecture as Compute Engine
      6. Highly-efficient Small Matrix Kernels
      7. Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
      8. Dense Matrix Kernel Generation: AVX2
      9. Dense Matrix Kernel Generation: AVX-512
      10. Kernel Performance Benchmarking
      11. Incorporating Knights Landing’s Different Memory Subsystems
      12. Performance Evaluation
      13. Mount Merapi
      14. 1992 Landers
      15. Summary and Take-Aways
      16. For More Information
    4. Chapter 22: Weather research and forecasting (WRF)
      1. Abstract
      2. WRF Overview
      3. WRF Execution Profile: Relatively Flat
      4. History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
      5. Our Early Experiences With WRF on Knights Landing
      6. Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
      7. WRF CONUS12km Benchmark Performance
      8. MCDRAM Bandwidth
      9. Vectorization: Boost of AVX-512 Over AVX2
      10. Core Scaling
      11. Summary
      12. For More Information
    5. Chapter 23: N-Body simulation
      1. Abstract
      2. Parallel Programming for Noncomputer Scientists
      3. Step-by-Step Improvements
      4. N-Body simulation
      5. optimization
      6. Initial Implementation (Optimization Step 0)
      7. Thread parallelism (optimization step 1)
      8. Scalar Performance Tuning (Optimization Step 2)
      9. Vectorization with SOA (optimization step 3)
      10. Memory traffic (optimization step 4)
      11. Impact of MCDRAM on Performance
      12. Summary
      13. For More Information
    6. Chapter 24: Machine learning
      1. Abstract
      2. Convolutional Neural Networks
      3. OverFeat-FAST Results
      4. For More Information
    7. Chapter 25: Trinity workloads
      1. Abstract
      2. Out of the Box Performance
      3. Optimizing MiniGhost OpenMP Performance
      4. Summary
      5. For More Information
    8. Chapter 26: Quantum chromodynamics
      1. Abstract
      2. LQCD
      3. The QPhiX Library and Code Generator
      4. Wilson-Dslash Operator
      5. Configuring the QPhiX Code Generator
      6. The Experimental Setup
      7. Results
      8. Conclusion
      9. For More Information
  11. Contributors
  12. Glossary
  13. Index

Product information

  • Title: Intel Xeon Phi Processor High Performance Programming, 2nd Edition
  • Author(s): James Jeffers, James Reinders, Avinash Sodani
  • Release date: May 2016
  • Publisher(s): Morgan Kaufmann
  • ISBN: 9780128091951