Wednesday, July 27, 2011

Parallelizing Loops with Intel Thread Building Blocks

According to OpenCV Release Notes, the code will use Intel TBB (2.2+) instead of OpenMP.

Overview of TBB.

It is basically a library of C++ templates for concurrency. It covers concurrent random-access and even sequential containers, parallel iterations. It also gives concurrent memory allocator (reducing cache-line collision). Moreover, it has a scheduler for light-weight tasks instead of the more 'bulky' thread.

Paraphrasing from the TBB FAQ section, that means use OpenMP for C program and "large, predictable data parallel problems", use TBB for C++ and "less-structured and consistent parallelism". And of course OpenMP works at compiler level, meaning it relies on such support.


Intel uses dual-licensing for TBB library. Basically a commercial license with support and other goodies, another is no-frills open-source. The commercial license costs $299 for the first year, and $120 for renewal. The open source license is GPLv2 with C++ Runtime-Exception clause. I think it means that instantiating TBB templates does not require you to open-source your program, unless you modified the TBB itself.

Use in OpenCV 2.2 - (2.3 should have more in ML module)

internal.hpp: Implements a tiny subset of TBB in serial fashion when HAVE_TBB is undefined: template<> inline cv::parallel_for() inline cv::parallel_do inline cv::parallel_reduce ConcurrentVector, Split, BlockedRange.

Summary by template:
  • parallel_for: boost, cascadedetect, distransform, haar, hog, lkpyramid, stereobm, surf
  • parallel_reduce: rtrees.cpp, tree, features2d::evaluation.cpp
  • parallel_do : stereobm; wont end until the all items in list is processed, even the new items are added in operator(); downside - no random access meaning no concurrency.
  • ConcurrentRectVector: cascadedetect, haar, hog

Configure and Build - OpenCV 2.2 Win32 + TBB 3.0

  • Picked up some changes from OpenCV trunk to add TBB support for VS2010 in CMakeFile.
  • The rest is straight-forward.
  • Run sample facedetect (CascadeClassifier):
    Shortens time required to detect faces and mouths (nested) from a selected picture: 34 sec -> 14 sec
  • Run sample peopledetect (HOG descriptor):
    Half the time taken: 1.5 -> 0.7 secs

Configure and Build - OpenCV 2.2 Linux GCC 4 + TBB 3.0
  • Must specify TBB_LIB_DIR all the way to the exact location (e.g. <tbb-top-level-lib-dir>/ia32/<gcc-version>)
  • It takes only around half the time for traincascade finishes. From 2:14:13 -> 1:13:08.

No comments:

Post a Comment