Some highlights of my primitive understanding on NPP and CUDA. The description below does not cover the graphical aspects of CUDA. There are topics on textures, getting resources from other 3D APIs and others.
General notions
Host - CPU, Device - GPU
Thread, Block, Grid, Core
Kernel - a single function to operate on data-array.
NPP - NVIDIA Performance Primitives (NPP_Library.pdf)
It is a set of C primitives that operates an arrays of data. Typical program flow: allocate memory on device, copying the input array to device, call the NPP functions, and copy the result array from device memory.
There are 2 sets of API. First set operates on 1D array - Signals. Another set operates on 2D array - Image.
- Signal functions (nppsXXX) : arithmetic, set, shift, logical, reduction.
- Image functions (nppiXXX) : arithmetic, set/copy, stats (min, max, mean,...), histogram, transforms (warp, affine, perspective, domain), color-space conversion, filtering, etc.
NVCC - NVIDIA compiler
The detail compile-flow is sophisticated. NVCC separates the input source (.cu) to run on host and device. It delegates host code to compiler that is responsible for the host application. Device portion of the code will be compiled to intermediate code (.ptx) or architecture-specific code (.cubin), based on the compiler options. Either way, the compiled device code will be embedded in the application binary. At application launch time, the PTX code will be compiled to arch-specific image, and download to device memory. PTX is the output for 'virtual architecture'. It is an abstract version of a device that is characterized with its compute-capability index (1.0, 1.1, .., 2.0,...). The NVIDIA Runtime library will find out what the actual device hardware at execution time and compile the PTX code accordingly.
CUDA - C extensions, Parallelizing Framework ( CUDA_C_Programming_Guide.pdf )
It has nothing to do with NPP (?). It primarily lets host applications to perform parallel data processing using GPU cores (SIMT). It defines a set of C extensions (Appendix B) so that programmer could define how code and data are placed and executed on the device. The framework supplies a set of Runtime API and Driver API. Device API is a lot like a Runtime API. It allows finer control in some cases, e.g. pushing / popping contexts.
Device contexts is similar to CPU processes.
Driver API - cuXXXX()
Typical programming flow: Initialize device - Create contexts - Load module ( PTX or arch-specific-binary) - Choose Kernel (function) from current context - Execute it.
Each host-thread keep a stack of contexts. The top-of-stack is 'current' context. Creating a context for a device automatically push it on top of the stack. A context could be popped from stack. It remains valid. Any threads could pick it up and run it. Omitted from the simplified flow above: calls to copy data from and to device around the kernel execution.
Runtime API - cudaXXXX()
Concept - Single Function Many Data
A Function is called Kernel. It is defined by prefixing __global__ to a C function in CUDA source (.cu). These functions could only be called by functions defined in CUDA source. Typically a Kernel calculates one element of output array.
Each core runs a block of threads by time-slicing(?). A GPU has N cores. GeForce 310M has 16.
The work load of processing an array is spread across available cores by grouping threads into blocks. NVIDIA runtime decide scheduling of these blocks into available cores. CUDA 4 supports up to 3 array dimensions (x, y, z).
Program Flow
Typical program flow is similar to aforementioned - Allocate host and/or device memory - Copy/Map data to device - Launch the kernel with the (in, out) array data locations, and number of blocks and block-size - Copy back the result data to host.
Error handling needs special care to get because of the asynchronous nature. 2 types of error checking: At-Entry (parameter checking) and At-Finish (Kernel function returns)
Syntax
__global__ defines the function as Kernel.
__device__, __const__ defines the variable on device global and constant area.
__shared__ defines the variable to be placed in thread-block memory.
Only supports a subset of C++. See Appendix B of CUDA_C_Programming_Guide.pdf.
A pair of triple-arrow-operator <<<, >>> specifies the data, thread-block-info and launch the kernel. It is ASYNCHRONOUS!
Other aspects
- Device Memory Hierarchy - Global, Block(shared), Thread
- Efficient use of memory - Copying from global to block is expensive. Make use of __shared__.
- Concurrency - Streams ( a sequence of asynchronous commands ). Data copy could be made asynchronous with 'async' variant of data copy functions. Concurrent-data-transfer and Concurrent-kernel-exec depends on GPU capability.
- Synchronization - Events, Explicit CUDA API, Implicit (device data copy to host, and others).