Meanshift object tracking with C

2014-01-21 本文已影响399人 ifcode

Object tracking is all about locating objects in a video sequence or in real time(real time tracking) and following the movement of the object.

In this project, I will present you the meanshift tracking method and its improvement camshift with the CUDA acceleration.

1. Meanshift and Camshift

Briefly speaking, the classical meanshift method contains three steps:

calculate meanshift vector
move the density estimation window
repeat the above process until it converge

As you can see that the whole process is about finding the densest center. I will not go through the mathematical detail behind, you can find a good reference here: Introducation to Mean shift.

Follow this method, I built my first version of object tracking(without CUDA):

It looks just OK, but I had found a big problem that is my program lost his target very easily. I found that the problem is deal to the fixed searching window size, which means the searching window was big so that takes some background into calculation:

Or it was too small that just contains a subset of object:

The tracking is not constant in both case because we didn't really taking the density of object. And this caused another problem which is the we can't change the depth level. If we move the object forwards or backwards to the camera, the changing object size will not fit the fixed searching window.

By solving this problem, I get to camshift (continuous adaptive meanshift) which is an improvement of meanshift introduced by Gary Bradski in 1998.

Briefly speaking, camshift applied meanshift first. Once the meanshift converged, it recalculate the size of searching window. And again it starts the meanshift with the new window and previous location. This process ends until it meets certain precision.

By recalculating the window size in each iteration, we could find the best fit searching window for target object. This insured that the stable and constant tracking:

2. Parallel programming with CUDA

A typical CUDA program contains the following 4 steps:

CPU allocates storage on GPU
CPU copies input data to GPU
CPU launches kernels on GPU to process the data
CPU copies results back form GPU to CPU

The principle of writing a CUDA program is firstly write your program as if it runs on one single thread and than the GPU will run that program on many threads.

But in case of parallelism, we have to face some unusual problems in procedural programming. The first problem is how the threads are communicating with each other through memory.

Parallel communication models

The simplest model is call "map", which means that tasks read from and write to specific data elements:

The second model is "gather" because the threads gather many input data:

And we also have "scatter" where the task computes where to write output:

Next is "stencil" where tasks read input from a fixed neighborhood in an array:

And finally we have "transpose" that re-order the data element:

Let's make a little conclusion:

map: one-to-one
gather: many-to-one
scatter: one-to-many
stencil: several-to-one
transpose: one-to-one

Memory model

CUDA has a 3 level memory model. A thread has it's own local memory. A block of threads has a shared memory which could be access by all threads inside the block. And the whole kernel has a global memory that is accessible for all contained threads.

Synchronization

As we define the block which contains many threads to compute part of the problem, we are facing the synchronization problem. In certain point of the program, threads should wait until others reached.

A simple technique CUDA applied is barrier, which the point that threads should stop ant wait. (syncthreads)

We've go through all important aspect of prallel programming, let's look at our Camshfit acceleration.

3. Camshift acceleration with CUDA

So the firstly thing we have to do, is to analyse which part of the program could be parallel. The process of Camshift is following:

construct histogram
back projection to obtain the color distribution
meanshift util it converges
recalculate the window size for next frame
continue the process until no more frame

The histogram could be built simultaneously. Instead of iterate pixel by pixel, we could just launch many threads to compute the bin in the same time.

__global__ void rgb_to_xyY(
   float* d_r,
   float* d_g,
   float* d_b,
   float* d_x,
   float* d_y,
   float* d_log_Y,
   float  delta,
   int    num_pixels_y,
   int    num_pixels_x )
{
 int  ny             = num_pixels_y;
 int  nx             = num_pixels_x;
 int2 image_index_2d = make_int2( ( blockIdx.x * blockDim.x ) + threadIdx.x, ( blockIdx.y * blockDim.y ) + threadIdx.y );
 int  image_index_1d = ( nx * image_index_2d.y ) + image_index_2d.x;

 if ( image_index_2d.x < nx && image_index_2d.y < ny )
 {
   float r = d_r[ image_index_1d ];
   float g = d_g[ image_index_1d ];
   float b = d_b[ image_index_1d ];

   float X = ( r * 0.4124f ) + ( g * 0.3576f ) + ( b * 0.1805f );
   float Y = ( r * 0.2126f ) + ( g * 0.7152f ) + ( b * 0.0722f );
   float Z = ( r * 0.0193f ) + ( g * 0.1192f ) + ( b * 0.9505f );

   float L = X + Y + Z;
   float x = X / L;
   float y = Y / L;

   float log_Y = log10f( delta + Y );

   d_x[ image_index_1d ]     = x;
   d_y[ image_index_1d ]     = y;
   d_log_Y[ image_index_1d ] = log_Y;
 }
}

The moment computing process is also parallelism, but in my project I didn't get this right. Maybe I will get this done later.

So let's look at the result:

4. Conclusion

In this project, I applied Camshift object tracking method with (part of) CUDA acceleration. The result is convincing, but there are still problems for future work:

The color space is not a robust feature to extract the object. The program could get lost with similar background color.
The parallel programming is complex, so I didn't success all the aspect in this program.
Deal to time limit, an performance evaluation is not held in this project.