C++ AMP

Book description

Capitalize on the faster GPU processors in today’s computers with the C++ AMP code library—and bring massive parallelism to your project. With this practical book, experienced C++ developers will learn parallel programming fundamentals with C++ AMP through detailed examples, code snippets, and case studies. Learn the advantages of parallelism and get best practices for harnessing this technology in your applications.

Discover how to:

  • Gain greater code performance using graphics processing units (GPUs)
  • Choose accelerators that enable you to write code for GPUs
  • Apply thread tiles, tile barriers, and tile static memory
  • Debug C++ AMP code with Microsoft Visual Studio®
  • Use profiling tools to track the performance of your code

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. Foreword
  3. Introduction
    1. Who Should Read This Book
      1. Assumptions
    2. Who Should Not Read This Book
    3. Organization of This Book
    4. Conventions and Features in This Book
    5. System Requirements
    6. Code Samples
      1. Installing the Code Samples
      2. Using the Code Samples
    7. Acknowledgments
    8. Errata & Book Support
    9. We Want to Hear from You
    10. Stay in Touch
  4. 1. Overview and C++ AMP Approach
    1. Why GPGPU? What Is Heterogeneous Computing?
      1. History of Performance Improvements
      2. Heterogeneous Platforms
      3. GPU Architecture
      4. Candidates for Performance Improvement through Parallelism
    2. Technologies for CPU Parallelism
      1. Vectorization
      2. OpenMP
      3. Concurrency Runtime (ConcRT) and Parallel Patterns Library
      4. Task Parallel Library
      5. WARP—Windows Advanced Rasterization Platform
      6. Technologies for GPU Parallelism
      7. Requirements for Successful Parallelism
    3. The C++ AMP Approach
      1. C++ AMP Brings GPGPU (and More) into the Mainstream
      2. C++ AMP Is C++, Not C
      3. C++ AMP Leverages Tools You Know
      4. C++ AMP Is Almost All Library
      5. C++ AMP Makes Portable, Future-Proof Executables
    4. Summary
  5. 2. NBody Case Study
    1. Prerequisites for Running the Example
    2. Running the NBody Sample
    3. Structure of the Example
    4. CPU Calculations
      1. Data Structures
      2. The wWinMain Function
      3. The OnFrameMove Callback
      4. The OnD3D11CreateDevice Callback
      5. The OnGUIEvent Callback
      6. The OnD3D11FrameRender Callback
    5. The CPU NBody Classes
      1. NBodySimpleInteractionEngine
      2. NBodySimpleSingleCore
      3. NBodySimpleMultiCore
      4. NBodySimpleInteractionEngine::BodyBodyInteraction
    6. C++ AMP Calculations
      1. Data Structures
      2. CreateTasks
    7. The C++ AMP NBody Classes
      1. NBodyAmpSimple::Integrate
      2. BodyBodyInteraction
    8. Summary
  6. 3. C++ AMP Fundamentals
    1. array<T, N>
    2. accelerator and accelerator_view
    3. index<N>
    4. extent<N>
    5. array_view<T, N>
    6. parallel_for_each
    7. Functions Marked with restrict(amp)
    8. Copying between CPU and GPU
    9. Math Library Functions
    10. Summary
  7. 4. Tiling
    1. Purpose and Benefit of Tiling
    2. tile_static Memory
    3. tiled_extent
    4. tiled_index<N1, N2, N3>
    5. Modifying a Simple Algorithm into a Tiled One
      1. Using tile_static memory
      2. Tile Barriers and Synchronization
      3. Completing the Modification of Simple into Tiled
    6. Effects of Tile Size
    7. Choosing Tile Size
    8. Summary
  8. 5. Tiled NBody Case Study
    1. How Much Does Tiling Boost Performance for NBody?
    2. Tiling the n-body Algorithm
      1. The NBodyAmpTiled Class
      2. NBodyAmpTiled::Integrate
    3. Using the Concurrency Visualizer
    4. Choosing Tile Size
    5. Summary
  9. 6. Debugging
    1. First Steps
      1. Choosing GPU or CPU Debugging
      2. The Reference Accelerator
    2. GPU Debugging Basics
      1. Familiar Windows and Tips
      2. The Debug Location Toolbar
      3. Detecting Race Conditions
    3. Seeing Threads
      1. Thread Markers
      2. GPU Threads Window
      3. Parallel Stacks Window
      4. Parallel Watch Window
      5. Flagging, Grouping, and Filtering Threads
    4. Taking More Control
      1. Freezing and Thawing Threads
      2. Run Tile to Cursor
    5. Summary
  10. 7. Optimization
    1. An Approach to Performance Optimization
    2. Analyzing Performance
      1. Measuring Kernel Performance
      2. Using the Concurrency Visualizer
      3. Using the Concurrency Visualizer SDK
    3. Optimizing Memory Access Patterns
      1. Aliasing and parallel_for_each Invocations
        1. Performance Impact of Aliasing
      2. Efficient Data Copying to and from the GPU
        1. Removing Unnecessary Copies
        2. Overlapping Asynchronous Copies
        3. Leaving Data on the GPU
        4. Using Staging Arrays
      3. Efficient Accelerator Global Memory Access
      4. Array of Structures vs. Structure of Arrays
      5. Efficient Tile Static Memory Access
      6. Constant Memory
      7. Texture Memory
      8. Occupancy and Registers
    4. Optimizing Computation
      1. Avoiding Divergent Code
      2. Choosing the Appropriate Precision
        1. Precise Math Functions
        2. Fast Math Functions
        3. Precise and Fast Compiler Flags
      3. Costing Mathematical Operations
      4. Loop Unrolling
      5. Barriers
        1. Performance Impact of Barriers and Fences
        2. Using Barriers Correctly
      6. Queuing Modes
    5. Summary
  11. 8. Performance Case Study—Reduction
    1. The Problem
      1. A Small Disclaimer
    2. Case Study Structure
      1. Initializations and Workload
      2. Concurrency Visualizer Markers
      3. TimeFunc()
      4. Overhead
    3. CPU Algorithms
      1. Sequential
      2. Parallel
    4. C++ AMP Algorithms
      1. Simple
      2. Simple with array_view
      3. Simple Optimized
      4. Naïvely Tiled
      5. Tiled with Shared Memory
      6. Minimizing Divergence
      7. Eliminating Bank Conflicts
      8. Reducing Stalled Threads
      9. Loop Unrolling
      10. Cascading Reductions
      11. Cascading Reductions with Loop Unrolling
    5. Summary
  12. 9. Working with Multiple Accelerators
    1. Choosing Accelerators
      1. Enumerating Accelerators
        1. Enumerating Accelerators
        2. The Default Accelerator
    2. Using More Than One GPU
    3. Swapping Data among Accelerators
    4. Dynamic Load Balancing
    5. Braided Parallelism
    6. Falling Back to the CPU
    7. Summary
  13. 10. Cartoonizer Case Study
    1. Prerequisites
    2. Running the Sample
    3. Structure of the Sample
    4. The Pipeline
      1. Data Structures
      2. The CartoonizerDlg::OnBnClickedButtonStart() Method
      3. The ImagePipeline Class
    5. The Pipeline Cartoonizing Stage
      1. The ImageCartoonizerAgent Class
      2. The IFrameProcessor Implementations
        1. The FrameProcessorCpu and FrameProcessorCpuMulti Classes
        2. The FrameProcessorAmpSingle Class
    6. Using Multiple C++ AMP Accelerators
      1. The FrameProcessorAmpMulti Class
      2. The Forked Pipeline
      3. The ImageCartoonizerAgentParallel Class
    7. Cartoonizer Performance
    8. Summary
  14. 11. Graphics Interop
    1. Fundamentals
      1. norm and unorm
      2. Short Vector Types
        1. Accessing Vector Components
        2. Template Metaprogramming
      3. texture<T, N>
        1. Data Storage
        2. Copying Data to and from Textures
        3. Reading from Textures
        4. Writing to Textures
        5. Read-Write Textures
      4. writeonly_texture_view<T, N>
      5. Textures vs. Arrays
    2. Using Textures and Short Vectors
    3. HLSL Intrinsic Functions
    4. DirectX Interop
      1. Accelerator View and Direct3D Device Interop
      2. Array and Direct3D Buffer Interop
      3. Texture and Direct3D Texture Resource Interop
      4. Using Graphics Interop
    5. Summary
  15. 12. Tips, Tricks, and Best Practices
    1. Dealing with Tile Size Mismatches
      1. Padding Tiles
      2. Truncating Tiles
        1. Handling Truncated Elements with Edge Threads
        2. Handling Truncated Elements with Sections
      3. Comparing Approaches
    2. Initializing Arrays
    3. Function Objects vs. Lambdas
    4. Atomic Operations
    5. Additional C++ AMP Features on Windows 8
    6. Time-Out Detection and Recovery
      1. Avoiding TDRs
      2. Disabling TDR on Windows 8
      3. Detecting and Recovering from a TDR
    7. Double-Precision Support
      1. Limited Double Precision
      2. Full Double Precision
    8. Debugging on Windows 7
      1. Configure the Remote Machine
      2. Configure Your Project
      3. Deploy and Debug Your Project
    9. Additional Debugging Functions
    10. Deployment
      1. Deploying your Application
      2. Running C++ AMP on Servers
        1. Enumerating C++ AMP-Capable Devices
        2. Running with XPDM Graphics Devices Present
        3. Running without a Connected Display
        4. Running on True Headless Servers
        5. Running as a Service or under Session 0
    11. C++ AMP and Windows 8 Windows Store Apps
    12. Using C++ AMP from Managed Code
      1. From a .NET Application, Windows 7 Windows Store App or Library
      2. From a C++ CLR Application
      3. From within a C++ CLR Project
    13. Summary
  16. A. Other Resources
    1. More from the Authors
    2. Microsoft Online Resources
    3. Download C++ AMP Guides
    4. Code and Support
    5. Training
  17. Index
  18. About the Authors
  19. Copyright

Product information

  • Title: C++ AMP
  • Author(s): Ade Miller, Kate Gregory
  • Release date: September 2012
  • Publisher(s): Microsoft Press
  • ISBN: 9780735664739