Thursday, July 28, 2011

Speed up with Intel Integrated Performance Primitives (IPP)

Current OpenCV support
OpenCV 2.3 : IPP 7.0
OpenCV 2.2 : IPP 5 - 6.1
The directory structure of IPP 7.0 and IPP 6.1 is different. And the OpenCV 2.2 CMakefile only checks for IPP library versions up to 6.1.

Licensing
It has both commercial and non-commercial license. A single-user commercial license costs $199 + $80 for annual-renewal. Non-commercial version only supports Linux.

Installation
Tried the 30-day evaluation of non-commercial version. Surprised it requires 2GB disk space for installation. The package itself is about 237MB.

What is IPP?
Relationship with OpenCV according to Intel - outdated with respect to current OpenCV code.
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq
IPP uses OpenMP for parallelization
http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-intel-ipp-threaded-functions/
Did not spend time to go into details. My impression is it provides a lot of OpenCV routines in image-processing, camera-calibration, optical flow. And quite of few of them have a 1-to-1 mapping to the corresponding function in IPP. Surprised to see so few area actually uses IPP now. Based on what I read from the discussion forum, IPP speeded up OpenCV1.x a lot, before having SSE acceleration.
SSE is instruction-set level hardware acceleration. IPP is a library that implements algorithms that takes advantage of SSE.

How is it used
Did a quick search of 'HAVE_IPP' from OpenCV 2.2 source code. These are areas currently IPP appears to be relevant: dxt,cpp (e.g. DFT), haar.cpp, hog.cpp, LK-optical-flow, HS-optical-flow.
Something I noticed is how the IPP library is loaded at run time. There are a few linking methods to choose from at build time. I think what OpenCV 2.2 is using is what is called Static Linking with Dispatching - ippStaticInit(). At build time OpenCV linked in a static library which provides a 'jump-table' to the actual dynamic-library at run-time. The IPP routines are optimized differently across processors based on their capabilities. ippStaticInit() chooses the suitable library to load based on the processor it's running on. See the IPP User Guide for details.

CMakeFile Configuration
Downloaded and installed IPP 6.1 Build 6 to Linux.
Point IPP_PATH CMake variable to the IPP lib directory, not IPP root directory.

Comparison
dft sample - the time took to complete dft() function reduced from 7XX ms to 3XX for an 5M-pixel image file.
facedetect sample - 5% speed up of face-detection function using haarcascade_frontalface_alt.xml. Some classifier will not trigger IPP enhanced code. See haar.cpp for the conditions required.

4 comments:

  1. Do you know if the IPP version of Haar classifiers can be used without image rescaling? I noticed the sample code for IPP seems to scale the image before doing the classification, instead of scaling the classifier. Why would this be faster?

    ReplyDelete
  2. I cannot find difference you mentioned is in the code. Could you let me know where exactly it is at? I am using OpenCV 2.2.

    ReplyDelete
  3. In haar.cpp, the only time that IPP is used in cvHaarDetectObjects is when "flags & CV_HAAR_SCALE_IMAGE" is true. So this means that IPP is only relevant when the image is scaled (instead of scaling classifier). This sounds like it would be slower, since all candidate regions have to be resized before running them through the haar classifier. What are your thoughts on this?

    ReplyDelete
  4. So when we scale the classifier, specifically the feature vectors, I suppose it involves scaling the associated weights, integral sums, thresholds too. I don't think it's always true that one way is faster than the other. The feature vectors could be overlapping, and the area could be ended up pretty big. And the don't forget that it takes significant amount of time to scan the feature window across the image at various scale levels. I would like to think that scanning takes up the majority of computation time, do you agree?

    ReplyDelete