We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Appendix G lists and discusses the contents of our cx support header files. The cx.h header includes basic definitions for type and thrust support and is likely to be used by most programs. The other header files provide more specialised support for timers, binary IO and CUDA textures. This appendix is where detailed information on the use of CUDA textures can be found. It is thought that some version of Morton indexing is used for 2D texture layout and that described.
Appendix A provides a history of the evolution of NVIDIA GPUs and CUDA. It explains NVIDIA’s compute capability (CC) scheme for tracking the hardware capabilities for each GPU generation and discusses the evolution of CUDA software over successive releases of the CUDA SDK.
Chapter 2 gives a more formal account of the ideas introduced in Chapter 1. We discuss the requirements for writing CUDA kernel code and explain the syntax in detail. We encourage the reader to start thinking in parallel by introducing some key coding ideas including methods for summing a large number of values in parallel for so-called reduction operations. This chapter also introduces GPU shared memory, illustrated with a tiled matrix multiplication example. We demonstrate how the __restrict keyword applied to kernel pointer arguments can speed up your code. In some sense this is our most conventional chapter for a book on CUDA, and the reduction operation is revisited in a number of later chapters to help introduce new CUDA features. However, many of our other examples go well beyond what you can find elsewhere.
Appendix D discusses the parallel vector computational abilities of Intel CPUs using versions of the AVX instruction set. The Intel compiler is available as an alternative to the Microsoft VC C++ compiler used by default in Visual Studio. A combination of the Intel intrinsics library and OpenMP allow us to increase the performance of a simple saxpy calculation on a 4-core Haswell CPU from about 1 GFlop/sec to 70 GFlops/sec. For comparison, the GPU delivers about 8000 GFlops/sec.
Chapter 5 continues the theme of digital image manipulation and considers transformations such as rotation or scaling with required pixel interpolation to create the most accurate final result. The GPU hardware texture units are used for this and their features are discussed. The cx utilities provided with our code include wrappers that significantly simplify the creation of CUDA textures. Curiously, these hardware texture units are rarely discussed in other CUDA tutorial material for scientific applications but we find they can give a 5-fold performance boost. We show how OpenCV can be used to provide a simple GUI interface for viewing the transformed images with very little coding effort. We end the chapter with a fully working 3D image registration program using affine transformations applied to volumetric MRI data sets. The 3D affine transformations are about 1500 times faster on the GPU than on the host CPU and a full registration between two MRI images of size 256 × 256 × 256 takes about one second.
Chapter 8 demonstrates in detail the simulation of data acquisition and analysis in a large experiment. The case chosen is simulation of event detection in a clinical PET scanner and the subsequent reconstruction of activity distribution in the subject. This is directly relevant to people working in medical imaging but also more generally it is an example of how to approach a large simulation.We present examples illustrating random event generation, ray tracing in both simple and more complex geometries and the subsequent analysis to find the detector response in the form of a system matrix. Reconstruction of simulated patient data is then performed using the MLEM algorithm. Numerous optimisation details are discussed including the use of polar coordinates to fully exploit the symmetry of the detector system. The calculations involved are substantial and the GPU is very effective with speed-ups of over 1000 for simulation and the MLEM, iteration time is reduced to a few seconds. At the end of the chapter, Richardson–Lucy deconvolution of some blurred text is demonstrated as a different application of the MLEM method. The method converges slowly and we find that deblurring continues to improve even after 1000000 iterations.
Chapter 10 describes the various tools available for both profiling kernel performance and debugging code.For profiling we discuss both the older CUDA nvprof command line profiler and associated NVVP GUI and the newer Nsight Systems and Nsight Compute profilers which have many options. In particular, Nsight Compute can give detail of the performance within an individual kernel which was not possible before. Our discussion of debugging is based on the tools in Microsoft Visual Studio both for conventional C++ debugging and to enhance CUDA plugins for kernel debugging. The CUDA (Next-Gen) toolset allows line by line monitoring of individual threads during kernel execution.
CUDA uses the NVCC compiler to generate GPU code. This appendix discusses some of the important options users can use to tune the performance of their code.
Chapter 9 discusses how to share a single calculation between multiple GPUs on a workstation. CUDA provides a number of tools to both manage individual devices and for memory management so that multiple devices can see a common shared memory pool. CUDA unified virtual addressing (UVA) is an example of this. Transfers of data between the host and GPU memory can also be automated or eliminated using unified memory (UM) or zero-copy memory. To scale beyond a single workstation the well-known message passing interface (MPI) library is often used and this is described with a simple example.
The key to parallel programming is sharing a task between many cooperating threads running in parallel. A chart is presented showing how since 2003 the Moore’s law growth in computing performance has depended on parallel computing. This chapter includes a simple introductory CUDA example which performs numerical integration using 1000 000 000 threads. Using CUDA gives a speed-up of about 1000 compared to a single CPU thread. Key CUDA concepts including thread blocks, thread grids and warps are introduced. The hardware differences between conventional CPU architectures and GPUs are then discussed. Optimisations in memory caching on GPUs are also explained as memory access time is often a key performance constraint for many programs. The use of OpenMP to share a single task across all cores of a multicore CPU is also discussed.
Chapter 7 explores the ability of GPUs to perform multiple tasks simultaneously, including overlapping IO with computation and the simultaneous running of multiple kernels. CUDA streams and events are advanced features that allow users to manage multiple asynchronous tasks running on the GPU. Examples are given and the NVIDIA visual profiler (NVVP) is used to visualise the timeline for tasks in multiple CUDA streams. Asynchronous disk IO on the host PC can also be performed and examples using the C++ <threads> are given. Finally, the new CUDA graphs feature is introduced. This provides a wrapper for efficiently launching large numbers of kernel calls for complex workloads.
Appendix B discusses the role of atomic operations in parallel computing and the available function in CUDA. An example is provided showing the use of atomicCAS to implement another atomic operation.
This chapter discusses the tensor core hardware available on newer GPUs. This hardware is designed to perform fast mixed precision matrix multiplications and is intended for applications in AI.However, CUDA exposes their use to programmers with the warp matrix function library. These functions support tiled matrix multiplication using 16 × 16 tiles.We provide examples of their use to improve on the early matrix multiplication example in Chapter 2.We also show how reduction operations can be performed using tensor codes as a potential non-AI application.
Chapter 6 explains the CUDA random number generators provided by the cuRAND library. The CUDA XORWOW generator was found to be the fastest generator in the cuRAND library. The classic calculation of pi by generating random numbers inside a square is used as a test case for the various possibilities on both host CPU and the GPU. A kernel using separate generators for each thread is able to generate about1012 random numbers per second and is about 20 000 times faster than the simplest host CPU version running on a single core. The inverse transform method for generating random numbers from any distribution is explained. A 3D Ising model calculation is presented as a more interesting application of random numbers.The Ising example has a simple interactive GUI based on OpenCV.