A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao
Review Invariant to affine transformation, such as rotation, translation, and scale change; Denotes a set of stable connected components that are detected in gray scale image;
Review MSER is a stable Connected Component of thresholded image All pixels inside the MSER have higher or lower intensities than in the surrounding regions Regions are selected to be stable over intensity range
Sequential and Parallel Approach Sequential { Parallel { bucketSort(); buildDirectedGraph( ); Find ( ); blockReduction( ); Union( ); parentCompression( ); Update( ); // already get regions GetRegion( ); computeVariation( ); computeVariation( ); findRoot( ); leastVariation( ); } } leastVariation( );
buildDirectedGraph A parent’s value of each pixel should no less than its current value local memory: visited, members Shared memory
buildDirectedGraph Memory Usage: local memory: visited, members Shared memory Also process edge for next step
Block Reduction 16*16, 8*8
Block Reduction 16*16, 8*8
Block Reduction 16*16, 8*8
Block Reduction totally 3 iterations are needed log 2 4 log 2 2
Block Reduction If (horizontal_pixelUpdate) Load edge information to each pixel
Block Reduction History buffer
Parent Compression Shared memory based on parent locality
FindRegion FindRoot, so that we can process each region’s tree respectively Find region’s parent and child based on the delta, so that variation can be computed. var = (area(parent) – area(child))/area(current region); Send the region information to CPU Scan every region’s tree, find the minival variation, which is MSER regions. Filter the region
Performance Analysis For 256*256 image,
Performance Analysis For 1024*768 image,
Performance Analysis Why 8*8 better than 16*16? local memory usage recursion times block execution block reduction times parent locality
Performance Analysis GPU vs CPU timing intermidiate values Synchronization record information memory transfer
Conclusion Very large data dependancy, still can be solved. Should be suitable to multicore microprocessor, whose individual core is strong enough than the single thread in GPU. The bottenleck is still memory.
Future Work More efficient block reduction. (decoder and encoder) Memory random access GPU code effciency