Saturday, July 30, 2011

CUDA 4 installation

OpenCV requires CUDA 4 Toolkit. Moreover, CUDA 4 supports VS 2010 while CUDA 3.2 only supports up to VS 2008.

Installation - README_SDK_Release_Notes.txt
Requires NVidia driver of version 270+. Upgraded such for my GeForce 310M.
Installed 64-bit CUDA Toolkit 4.0
Installed GPU Computing SDK.
Also downloaded BuildCustomization FIX just in case.

Verify the installation - CUDA_C_Getting_Started_Windows.pdf
Simply follow the Getting Started guide.
Some hiccups to resolve -
  1. Error building 64-bit 'cutil' - Change the Toolset configuration to Windows SDK 7.1 in order to get $(WindowSdkDir) macro pointing to 64-bit instead of 32 bit. http://stackoverflow.com/questions/3599079/windowssdkdir-is-not-set-correctly-in-visual-studio-2010
  2. Error building shrUtils - the source and include files are misplaced. Copied required source as seen in VC++ Project file from "C/Common/src". Add "C/Common/inc" to shrUtils "Additional Include Directories" to pick up the misplaced headers. http://forums.nvidia.com/index.php?showtopic=197097
Now able to run the build and run the tests suggested in the Getting Started guide.

Applied the BuildCustomization Fix
Not essential but definitely come across such problem in later experiments. So it's still a fix needed for this toolkit. Description of the problem is in the README file that comes with the patch. Quite straightforward, actually.

Friday, July 29, 2011

Install OpenCV 2.3.0 for CUDA

Decided to try out what OpenCV + CUDA is like. Prefer to start with 2.3.0 - quite a few fixes since 2.2.0 in this area.

Simple Build and Run of OpenCV 2.3.0 Release
Download OpenCV-2.3.0-win-src package.
Typical configuration with CMakefile for VS2010 Express - with C samples.
Build 32-bit - run video-starter to check video-capture is working - YES.
Build 64-bit - run video-starter to see video-capture is working - YES.

Build API documentation to understand GPU module
Download and Installed MikTex 2.9 Portable version to C:\Program Files (x86).
Configure CMake: BUILD_DOCS=yes and set MIKTEX_BINARY_PATH = < miktex/bin directory >
Needs Sphinx python module (sphinux-build.exe) to satisfy CMake configuration of Build Documentation
  • Install Python 2.6.5 (Win32)
  • Install setuptools
  • Download sphinx python egg
  • Run easy_install (from setuptools) on the egg.
  • Run CMake config again, specify exact path to the sphinx. This entry now appears now that Python installation is detected.
Open OpenCV.sln with VS2010 Express - ALL_BUILDS configuration currently excludes 'docs' and 'html_docs' project.
Compile 'docs' first - error at the end(?) saying the pdflatex.exe: Access is denied. Seems like pdflatex is trying to write some file within the Miktex program directory. Next time I should move the MikTex Portable directory to C:\ProgramData instead.
Build 'html_docs' - finished OK; HTML API docs now appears in the build directory.

Also noticed:
a WITH_OPENNI option in CMake - and there is a kinect_sample.

Thursday, July 28, 2011

Speed up with Intel Integrated Performance Primitives (IPP)

Current OpenCV support
OpenCV 2.3 : IPP 7.0
OpenCV 2.2 : IPP 5 - 6.1
The directory structure of IPP 7.0 and IPP 6.1 is different. And the OpenCV 2.2 CMakefile only checks for IPP library versions up to 6.1.

Licensing
It has both commercial and non-commercial license. A single-user commercial license costs $199 + $80 for annual-renewal. Non-commercial version only supports Linux.

Installation
Tried the 30-day evaluation of non-commercial version. Surprised it requires 2GB disk space for installation. The package itself is about 237MB.

What is IPP?
Relationship with OpenCV according to Intel - outdated with respect to current OpenCV code.
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq
IPP uses OpenMP for parallelization
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-intel-ipp-threaded-functions/
Did not spend time to go into details. My impression is it provides a lot of OpenCV routines in image-processing, camera-calibration, optical flow. And quite of few of them have a 1-to-1 mapping to the corresponding function in IPP. Surprised to see so few area actually uses IPP now. Based on what I read from the discussion forum, IPP speeded up OpenCV1.x a lot, before having SSE acceleration.
SSE is instruction-set level hardware acceleration. IPP is a library that implements algorithms that takes advantage of SSE.

How is it used
Did a quick search of 'HAVE_IPP' from OpenCV 2.2 source code. These are areas currently IPP appears to be relevant: dxt,cpp (e.g. DFT), haar.cpp, hog.cpp, LK-optical-flow, HS-optical-flow.
Something I noticed is how the IPP library is loaded at run time. There are a few linking methods to choose from at build time. I think what OpenCV 2.2 is using is what is called Static Linking with Dispatching - ippStaticInit(). At build time OpenCV linked in a static library which provides a 'jump-table' to the actual dynamic-library at run-time. The IPP routines are optimized differently across processors based on their capabilities. ippStaticInit() chooses the suitable library to load based on the processor it's running on. See the IPP User Guide for details.

CMakeFile Configuration
Downloaded and installed IPP 6.1 Build 6 to Linux.
Point IPP_PATH CMake variable to the IPP lib directory, not IPP root directory.

Comparison
dft sample - the time took to complete dft() function reduced from 7XX ms to 3XX for an 5M-pixel image file.
facedetect sample - 5% speed up of face-detection function using haarcascade_frontalface_alt.xml. Some classifier will not trigger IPP enhanced code. See haar.cpp for the conditions required.

Wednesday, July 27, 2011

Parallelizing Loops with Intel Thread Building Blocks

According to OpenCV Release Notes, the code will use Intel TBB (2.2+) instead of OpenMP.

Overview of TBB.

It is basically a library of C++ templates for concurrency. It covers concurrent random-access and even sequential containers, parallel iterations. It also gives concurrent memory allocator (reducing cache-line collision). Moreover, it has a scheduler for light-weight tasks instead of the more 'bulky' thread.

Paraphrasing from the TBB FAQ section, that means use OpenMP for C program and "large, predictable data parallel problems", use TBB for C++ and "less-structured and consistent parallelism". And of course OpenMP works at compiler level, meaning it relies on such support.

Licensing

Intel uses dual-licensing for TBB library. Basically a commercial license with support and other goodies, another is no-frills open-source. The commercial license costs $299 for the first year, and $120 for renewal. The open source license is GPLv2 with C++ Runtime-Exception clause. I think it means that instantiating TBB templates does not require you to open-source your program, unless you modified the TBB itself.

Use in OpenCV 2.2 - (2.3 should have more in ML module)

internal.hpp: Implements a tiny subset of TBB in serial fashion when HAVE_TBB is undefined: template<> inline cv::parallel_for() inline cv::parallel_do inline cv::parallel_reduce ConcurrentVector, Split, BlockedRange.

Summary by template:
  • parallel_for: boost, cascadedetect, distransform, haar, hog, lkpyramid, stereobm, surf
  • parallel_reduce: rtrees.cpp, tree, features2d::evaluation.cpp
  • parallel_do : stereobm; wont end until the all items in list is processed, even the new items are added in operator(); downside - no random access meaning no concurrency.
  • ConcurrentRectVector: cascadedetect, haar, hog


Configure and Build - OpenCV 2.2 Win32 + TBB 3.0

  • Picked up some changes from OpenCV trunk to add TBB support for VS2010 in CMakeFile.
  • The rest is straight-forward. http://opencv.willowgarage.com/wiki/TBB
  • Run sample facedetect (CascadeClassifier):
    Shortens time required to detect faces and mouths (nested) from a selected picture: 34 sec -> 14 sec
  • Run sample peopledetect (HOG descriptor):
    Half the time taken: 1.5 -> 0.7 secs

Configure and Build - OpenCV 2.2 Linux GCC 4 + TBB 3.0
  • Must specify TBB_LIB_DIR all the way to the exact location (e.g. <tbb-top-level-lib-dir>/ia32/<gcc-version>)
  • It takes only around half the time for traincascade finishes. From 2:14:13 -> 1:13:08.






Tuesday, July 26, 2011

Parallelizing Loops with OpenMP

According to OpenCV Release Notes, use of OpenMP is no longer in active support since OpenCV 2.1. They have been replaced by Thread Building Blocks (TBB).

OpenMP relies on #pragma directives. Telling compiler to parallelize loops / code blocks. Change of existing code is small compare to other methods.

_OPENMP will be defined by compiler that supports OpenMP.

Searching _OPENMP from source code discover current 'leftover' implementations here:
contrib
  • selfsimilarity.cpp (disabled with #if 0 block)
  • spinimages.cpp ( parallel for in computeSpinImages() )
features2D
  • stardetector.cpp ( parallel for in icvStarDetectorComputeResponses() )
core
  • system.cpp (getNumThreads(),...)
    opencv_haartraining
    • cvboost.cpp ( parallel private in cvCreateMTStumpClassifier() ) 
    • cvhaartraining ( uses CV_OPENMP instead of _OPENMP. Moreover, it's only enabled with MSVC and ICC compilers, not GCC )
    legacy
    • blobtrackanalysisior.cpp: parallel CvBlobTrackAnalysisIOR:Process()
    • blobtrackingmsfg.cpp: parallel UpdateWeightsMS(), UpdateWeightsCC()


    Experiment with haartraining
    1. Used Linux Build too take advantage of GCC 4 OpenMP support. There are OpenMP options in MSVC 2010 Express. But Microsoft webpage states there is no support. (http://msdn.microsoft.com/en-us/library/tt15eb9t.aspx)
    2. Recovered ENABLE_OPENMP in OpenCV 2.2 CMakeFiles by un-commenting such occurrences.
    3. Used CMake-GUI to configure build with ENABLE_OPENMP turned on.
    4. Called CvGetNumThreads() in haartraining.cpp to see if both CPU cores are available for use.
    Result
    • Took half the time to perform the same training. The results are basically the same. Noticeable differences are the node-split threshold values differs from the 5th decimal points on. 
    Note
    • cvhaartraining.cpp uses a variant of the OPENMP define - CV_OPENMP. Training terminated with SegFault with that turned on.