Friday, August 26, 2011

Convert RAW YUV File to AVI with AVISynth and VirtualDubMod

Here is a straightforward way of placing a raw video file in YUV420 format into an AVI container.
  • Download and Install AviSynth 2.5.8
  • Download and Place the RawSource AviSynth Plug-in to AviSynth plugins directory.
  • Download and Install VirtualModDub
  • Now prepare a simple AviSynth script file called importYUV.avs with the following lines:
file_1 = rawsource("overbridge.yuv", 1920, 1088, "I420")
return file_1
  • Run VirtualModDub
  • Open the importYUV.avs as Video File.
    Now the first frame of video should appear on VirtualModDub window.
  • Choose File->SaveAs to save the imported YUV file as AVI
    • Choose 'Direct stream copy' as "Video mode".
    • Press 'Save' Button.
  • Ta-da! 

Tuesday, August 2, 2011


Some highlights of my primitive understanding on NPP and CUDA. The description below does not cover the graphical aspects of CUDA. There are topics on textures, getting resources from other 3D APIs and others.

General notions
Host - CPU, Device - GPU
Thread, Block, Grid, Core
Kernel - a single function to operate on data-array.

NPP - NVIDIA Performance Primitives (NPP_Library.pdf)
It is a set of C primitives that operates an arrays of data. Typical program flow: allocate memory on device, copying the input array to device, call the NPP functions, and copy the result array from device memory.
There are 2 sets of API. First set operates on 1D array - Signals. Another set operates on 2D array - Image.

  • Signal functions (nppsXXX) : arithmetic, set, shift, logical, reduction.
  • Image functions (nppiXXX) : arithmetic, set/copy, stats (min, max, mean,...), histogram, transforms (warp, affine, perspective, domain), color-space conversion, filtering, etc.
NVCC - NVIDIA compiler
The detail compile-flow is sophisticated. NVCC separates the input source (.cu) to run on host and device. It delegates host code to compiler that is responsible for the host application. Device portion of the code will be compiled to intermediate code (.ptx) or architecture-specific code (.cubin), based on the compiler options. Either way, the compiled device code will be embedded in the application binary. At application launch time, the PTX code will be compiled to arch-specific image, and download to device memory. PTX is the output for 'virtual architecture'. It is an abstract version of a device that is characterized with its compute-capability index (1.0, 1.1, .., 2.0,...). The NVIDIA Runtime library will find out what the actual device hardware at execution time and compile the PTX code accordingly.

CUDA - C extensions, Parallelizing Framework ( CUDA_C_Programming_Guide.pdf )
It has nothing to do with NPP (?). It primarily lets host applications to perform parallel data processing using GPU cores (SIMT). It defines a set of C extensions (Appendix B) so that programmer could define how code and data are placed and executed on the device. The framework supplies a set of Runtime API and Driver API. Device API is a lot like a Runtime API. It allows finer control in some cases, e.g. pushing / popping contexts.
Device contexts is similar to CPU processes.
Driver API - cuXXXX()
Typical programming flow: Initialize device - Create contexts - Load module ( PTX or arch-specific-binary) - Choose Kernel (function) from current context - Execute it.
Each host-thread keep a stack of contexts. The top-of-stack is 'current' context. Creating a context for a device automatically push it on top of the stack. A context could be popped from stack. It remains valid. Any threads could pick it up and run it. Omitted from the simplified flow above: calls to copy data from and to device around the kernel execution.
Runtime API - cudaXXXX()
Concept - Single Function Many Data
A Function is called Kernel. It is defined by prefixing __global__ to a C function in CUDA source (.cu). These functions could only be called by functions defined in CUDA source. Typically a Kernel calculates one element of output array.
Each core runs a block of threads by time-slicing(?). A GPU has N cores. GeForce 310M has 16.
The work load of processing an array is spread across available cores by grouping threads into blocks. NVIDIA runtime decide scheduling of these blocks into available cores. CUDA 4 supports up to 3 array dimensions (x, y, z).
Program Flow
Typical program flow is similar to aforementioned - Allocate host and/or device memory - Copy/Map data to device - Launch the kernel with the (in, out) array data locations, and number of blocks and block-size - Copy back the result data to host.
Error handling needs special care to get because of the asynchronous nature. 2 types of error checking: At-Entry (parameter checking) and At-Finish (Kernel function returns)
__global__ defines the function as Kernel.
__device__, __const__ defines the variable on device global and constant area.
__shared__ defines the variable to be placed in thread-block memory.
Only supports a subset of C++. See Appendix B of CUDA_C_Programming_Guide.pdf.
A pair of triple-arrow-operator <<<, >>> specifies the data, thread-block-info and launch the kernel. It is ASYNCHRONOUS!
Other aspects
  • Device Memory Hierarchy - Global, Block(shared), Thread
  • Efficient use of memory - Copying from global to block is expensive. Make use of __shared__.
  • Concurrency - Streams ( a sequence of asynchronous commands ). Data copy could be made asynchronous with 'async' variant of data copy functions. Concurrent-data-transfer and Concurrent-kernel-exec depends on GPU capability.
  • Synchronization - Events, Explicit CUDA API, Implicit (device data copy to host, and others).

Monday, August 1, 2011

OpenCV 2.3.0 GPU speed-up with CUDA 4

Now it's time to build OpenCV 2.3.0 with GPU enabled.

Configuration and Build -
Follow the steps described in the OPENCV_GPU page for Visual Studio 64-bit build.
module-gpu build error: Configuration(null)
Solution - missing vcvars64.bat in Windows SDK amd64 directory. Create that by following the simple instructions here
Taken by surprised at first because I am able to build 64-bit OpenCV. I suspect it has to do with nvidia compiler (nvcc). It probably open a windows shell to do compilation. And that would not have the 64-bit environment set up without this vcvars64.bat.

Test GPU build by running module-gpu-test suite from VS 2010 Express
See "Implementing tests" section of
Setting up Test Data
Test-Data is required by the gpu-test-suite (and others too). Download a snapshot of the opencv-extra package that is tagged for OpenCV 2.3.0 release from WillowGarage. There is a "Download Zip" link in the source browsing page that makes it convenient.
Set the environment variable OPENCV_TEST_DATA_PATH to point to the testdata directory.
Run the project module-gpu-test
Resulted in 3 types of failures
  1. My NVidia hardware that has compute-capability of 1.2. 1 case requires 1.3
  2. Crash in meanShift and meanShiftProc. The stack trace shows that it dies at the point where GpuMat variable is being released.
  3. Assertion error in NVidia.TestHaarCascadeAppl. (Didn't investigate further).
The other tests run OK.

Learned to use the gtest_ command-line argument - see code comments above ParseGoogleTestFlagsOnlyImpl()
  • gtest_list_tests : shows the tests selected to run and quit
  • gtest_filter= : select the tests to run / or not to run by matching a specified pattern against test name. Pattern for negative matching begins with minus sign.
  • gtest_output=xml[: directory name / file-name ] : output a summary of tests results in XML. Details see ts_gtest.cpp (search for GTEST_DEFINE_string_)
OpenCV GPU module
The library implements accelerated versions of other areas of OpenCV  - image processing, image filtering, matrix calculations, features-2D and, object detection, camera calibration. The API and data-structures are defined in nested namespace cv::gpu::. The accelerations makes use of both NPP API and CUDA parallelization.

Run a few OpenCV GPU samples that could readily compared with non-GPU ones
  • surf_keypoint_matcher vs matcher_simple: speed up from 46 secs to 6 secs with the graffiti image from VGG set.
  • mofology vs morphology2 : not very obvious in my quick test. still noticeable when changing the element shape at a Open/Close set at 17-iterations.
  • hog_gpu vs peopledetect : speed up from 67 to 17 secs with my 5M-pixel test image.
  • cascadeclassifier_nvidia_api vs cascadeclassifier(GPU)vs facedetect (no-nested-cascade) : overall (secs): 5.1 / 4.8 / 4.5; detection-only(secs): 1 secs / 1 / 3.1