Foreground-Background Separation on GPU using order based approaches Raj Gupta, Sailaja Reddy M., Swagatika Panda, Sushant Sharma and Anurag Mittal Indian Institute of Technology Madras, Chennai, INDIA Contributi ons: Application Domain: Foreground-Background Separation GPU implementation Problem Statement: Open Issues Addressed: Foreground-Background separation using order-based approaches and speedup using GPUs. Transform from intensity to order space for illumination and noise invariance. Use Stable Monotonic Change Invariant Feature Descriptor. Implementation in GPU(NVIDIA Tesla C1060 Processor) to achieve up to 25x speed compared to CPU. Computation of Extremal Regions, the regions(R +,R - ) that have intensities above or below a given threshold, R + = Thresh + (I, T 1 ) R - = Thresh - (I, T 2 ) where, thresholds T 1, T 2 differ by a small value δ I. [1] R. Gupta and A. Mittal,Smd:A locally stable monotonic change invariant feature descriptor. In Computer Vision – ECCV Foreground Background Classification Increase in robustness of Foreground - Background separation w.r.t. illumination and noise. Improvement of speed and stability of the process. Offload the CPU for subsequent processes (tracking, object recognition etc.) Fig. Overlapped patches in an image Fig. An example of pairs extracted in a patch. Matching Final weighted Matching Score = weighted sum of order flips: I 0 (p): intensity of the second patch at the point corresponding to p in first patch. Augmenting results with patch information If (either of the two overlapping patches are foreground), common region of the two is declared as foreground. else common region is declared as background. Total Stability Measure: the sum of squares of stability factors of all pixels in a patch. Homogeneous patches: patches with low value of total stability measure. Compare their average intensities for matching..Non-homogeneous patches: Total stability measure is high, Compare their Matching score M for matching. Total no. of patches: (W i / W p * H i /W p ) + ( (W i / W p – 1)* (H i /W p -1)) Independent processing for each patch. Each block, has 256 threads, each thread processes one patch. Each patch is shared by atmost 4 regions. Total no. of overlapped regions : 4 * (H i /W p – 1)*(W i /W p – 1) Declare a grid of (H i /W p – 1) blocks with 4 *(W i /W p – 1) blocks each. Keep the data in the shared memory for speed. Facilitates concurrent copying the binary image data from device to host. Results. R1, R2: indoor scene on the PETS database. R3: outdoor scene on a sunny day with (shadows and leaf-movements.) R4: outdoor scene on a cloudy day with less illumination. (a)Input Image, background subtraction using (b) LTP (c)LBP (d) using GMM (e)using monotonic change invariant method Conclusion Foreground Background Separation Method using the order of intensities. Robust to noise, distortion in spacial domain of pixels, to fluctuations in background (shadows, clouds, weather change etc.) Good results in both indoor and outdoor environments. Implemented in GPU to achieve high throughput.(Speed up of 25X for 960x720 image resolution ). Computation of Point Pairs Compute point-pairs (one each from R + and R - ) using distance transform Select the most stable point-pairs based on the distance transform. stability factor ( s i ) of each point-pair: minimum of the distance values of the two points in the point-pair. {(p i 1, p i 2,s i ), i = 1 … n} form the feature descriptor for matching the patches of the two frames (the current frame and the background model.) Extremal regionsExtract Stable Points Learning the Background The background model is dynamically upgraded as follows: bk = bk * (1 - α) + fg * α where, α : learning rate bk: pixel intensity of the existing background model fg: pixel intensity of the current frame Form Point Pairs for comparison at different thresholds NVIDIA Tesla C1060 follows 10 series NVIDIA architecture and has 30 multiprocessors. Each multiprocessor has 8 cores, a double precision unit and an on- chip shared memory. interfaced by CUDA Performance Optimization: Maximizing parallel execution Optimizing memory Transfer Optimizing Memory Usage Optimizing Instruction Usage GPU Implementation Fig. Physical Memory Layout of Tesla Processors Fig. An overview of memory management and program flow on the GPU. Copy command: Host device Memory Transfer Red Arrow: concurrent transfer and kernel execution. Synchronize command : synchronizes all the data transfer and kernel execution before further execution. Fig.Thread Batching Fig. (a) Parallelizing over overlapped regions. Each thread in a block process for one respective region (yellow). (b)Timeline for serial execution and copy vs concurrent asynchronous copy. both the grid and the block are declared as 2D Kernel Grid :2D group(of blocks) of size (W i /W p, H i /W p ) corr. to each patch Block ;2D group (of threads) of size (W p, W p ) corr. to each pixel. Each thread makes coaleasced access to the pixel information. Results