Presentation is loading. Please wait.

Presentation is loading. Please wait.

Histogram Equalization with Cell Broadband Engine™

Similar presentations


Presentation on theme: "Histogram Equalization with Cell Broadband Engine™"— Presentation transcript:

1 Histogram Equalization with Cell Broadband Engine™
Created by Jay Kruemcke 09/20/05 IBM Confidential 1/17/2019

2 Content Overview: Histogram Equalization Definitions
Assumptions, Highlights Approach: Histogram Computation Approach: Transform Image Performance Results IBM Confidential 1/17/2019

3 Overview: Histogram Equalization
One of the most significant part of Image Processing Improves contrast by redistributing intensity distributions Compute a uniform histogram Three stages: Compute Normalize Transform IBM Confidential 1/17/2019

4 Definitions First Stage: Computing the Histogram Parse the input image
Count each distinct pixel value in the image Ex. for 8-bit pixels, the Max Pixel Value is 255, and array size is 256. Second Stage: Computing the normalized sum of histogram Store the sum of all the histogram values normalize by multiplying each element by (maximum-pixel-value/number of pixels). Third Stage: Transforming input image into output image Use the normalized array as a look up table for mapping the input image pixel value to the new set of values from stage IBM Confidential 1/17/2019

5 Assumptions, Highlights
Assumptions for demo: 8-bit color scale Approach Highlights: Parallelize Reduce dependencies Loop unroll SIMDize the code using vectors and SPE intrinsics IBM Confidential 1/17/2019

6 Scalar Code Flow #define ROUND(v) (int)((v) + 0.5) //!-- Round it to the closest integer #define __min(a,b) ( ((a) < (b)) ? (a) : (b) ) #define __max(a,b) ( ((a) > (b)) ? (a) : (b) ) #define BOUND(v) (unsigned char)(__min(255, __max((v), 0))) // 0-255 { int size = PIXEL_DATA_SIZE; unsigned char map[size]; unsigned char src[size]; unsigned char dest[size]; unsigned int counts[256]; double sc; long v; int i, index; unsigned int sum=0; for(i=0; i < size; i++) counts[i] = 0; src[i] = random() & 0xFF; } for (i=0; i<size; i++) { counts[src[i]]++; } sc = PIXEL_MAX_VALUE / (double) IMAGE_SIZE; for (i = 0; i < size; i++) sum += counts[i]; v = ROUND(sc * sum); map[i] = BOUND(v); dest[i] = map[src[i]]; Compute Histogram Normalized sum of Histogram Transform Histogram IBM Confidential 1/17/2019

7 Histogram Computation
Vector unsigned char - load 16 bytes at a time to use the 128 bit register boundary Data Array Byte 0 Byte F 1B 2B 3B 4B 1B 1 2 3 4 5 6 7 For ex. These 6 bits determine which of the 64 element array index it should go to These two bits decide which slot to go into Counter0[48] Slot ’10’ – 3rd slot 64 64 64 64 01 00 10 11 01 00 10 11 01 00 10 11 01 00 10 11 Counter 0 Counter 1 Counter 2 Counter 3 vector unsigned int vector unsigned int vector unsigned int vector unsigned int Slots containing 32 bit counter value 64 element vector(128 bits) arrays – each containing 4 32 bit counters 4 of them are created to enable parallel computation and loop unrolling IBM Confidential 1/17/2019

8 Code sections for Histogram computation
unsigned int idx_0, idx_1, idx_2, idx_3; int slot_0, slot_1, slot_2, slot_3; vector unsigned char in; vector unsigned char *vdata; vector unsigned int *vcounts; vector unsigned int in_0, in_1, in_2, in_3; vector unsigned int cnts_0[64]; vector unsigned int cnts_1[64]; vector unsigned int cnts_2[64]; vector unsigned int cnts_3[64]; vdata = (vector unsigned char *)(data); for (i=15; i<size; i+=16) { in = *vdata++; //!-- Loop Unroll 1: //!-- Handle the first 16 bytes from the input string in_0 = spu_and((vector unsigned int)(in), 0xFF); in_1 = spu_and(spu_rlmask((vector unsigned int)(in), -8), 0xFF); in_2 = spu_and(spu_rlmask((vector unsigned int)(in), -16), 0xFF); in_3 = spu_rlmask((vector unsigned int)(in), -24); idx_0 = spu_extract(in_0, 0); idx_1 = spu_extract(in_1, 0); idx_2 = spu_extract(in_2, 0); idx_3 = spu_extract(in_3, 0); slot_0 = (0 - idx_0) << 2; slot_1 = (0 - idx_1) << 2; slot_2 = (0 - idx_2) << 2; slot_3 = (0 - idx_3) << 2; idx_0 >>= 2; idx_1 >>= 2; idx_2 >>= 2; idx_3 >>= 2; cnts_0[idx_0] = spu_add(cnts_0[idx_0], spu_rlqwbyte(one, slot_0)); cnts_1[idx_1] = spu_add(cnts_1[idx_1], spu_rlqwbyte(one, slot_1)); cnts_2[idx_2] = spu_add(cnts_2[idx_2], spu_rlqwbyte(one, slot_2)); cnts_3[idx_3] = spu_add(cnts_3[idx_3], spu_rlqwbyte(one, slot_3)); //!– Repeat for 1, 2, 3, //!– Loop Unroll 2: - - - } /* Roll the counters into the overall (external) count array. */ for (i=0; i<64; i+=4) { vector unsigned int sum0, sum1, sum2, sum3; sum0 = spu_add(cnts_0[i], cnts_1[i]); sum1 = spu_add(cnts_0[i+1], cnts_1[i+1]); sum2 = spu_add(cnts_0[i+2], cnts_1[i+2]); sum3 = spu_add(cnts_0[i+3], cnts_1[i+3]); sum0 = spu_add(sum0, cnts_2[i]); sum1 = spu_add(sum1, cnts_2[i+1]); sum2 = spu_add(sum2, cnts_2[i+2]); sum3 = spu_add(sum3, cnts_2[i+3]); vcounts[i] = spu_add(sum0, cnts_3[i]); vcounts[i+1] = spu_add(sum1, cnts_3[i+1]); vcounts[i+2] = spu_add(sum2, cnts_3[i+2]); vcounts[i+3] = spu_add(sum3, cnts_3[i+3]); } This is repeated four times The above code section rolls the 4 counters into one counter IBM Confidential 1/17/2019

9 Normalized Sum float sc = PIXEL_MAX_VALUE/ (float) IMAGE_SIZE;
vector float vc = spu_splats((float)sc); float scr = 0.5; vector float vr = spu_splats((float) scr); vector float vf1, vf2; vector unsigned char splat0 = (vector unsigned char) {0,1,2,3, 0,1,2,3, 0,1,2,3, 0,1,2,3}; vector unsigned char splat1 = (vector unsigned char) {128,128,128,128, 4,5,6,7, 4,5,6,7, 4,5,6,7}; vector unsigned char splat2 = (vector unsigned char){128,128,128,128, 128,128,128,128, 8,9,10,11, 8,9,10,11}; vector unsigned char splat3 = (vector unsigned char){12,13,14,15, 12,13,14,15, 12,13,14,15, 12,13,14,15}; vector unsigned int mask3 = (vector unsigned int){0,0,0,-1} //!-- TODO: Convert it so the computation is pipelined. TRACE("Print the final character map: \n"); for(i=0; i<size; i++) { v = counts[i]; sum = spu_shuffle(sum, sum, splat3); v0 = spu_shuffle(v, v, splat0); v1 = spu_shuffle(v, v, splat1); v2 = spu_shuffle(v, v, splat2); v3 = spu_and(v, mask3); sum = spu_add(spu_add(spu_add(sum, v3), v2), spu_add(v1, v0)); //!-- Normalize, round it vf2 = spu_convtf(sum, 0); vf1 = spu_madd(vf2, vc, vr); mapvi[i] = spu_convtu(vf1, 0); for(j=0; j<4; j++) var = spu_extract(mapvi[i], j); map[k] = BOUND(var); //!-- TODO vectorize this TRACE("%d ", map[k]); k++; } Normalized Sum v = count[i] v0 v0 v0 v0 + v = count[i] 1. Compute the sum for the 64 vector entries 2. Multiply with the normalization constant 3. Clamp it to be 0-255 4. Store in an character map LUT X v1 v1 v1 + v = count[i] X X v2 v2 + v = count[i] X X X v3 IBM Confidential 1/17/2019

10 Transform the image Byte Shuffle using the MSB 5 bits
0 - 15 Byte Shuffle using the MSB 5 bits Select using index bit 2 Select using index bit 1 Select using index bit 0 1 2 3 4 5 6 7 IBM Confidential 1/17/2019

11 Performance Results Environment: Configuration: Performance numbers:
Benchmark was written in C and using xlc compiler. IBM Systemsim & Cell Blade was used to collect performance numbers. Sample grayscale image (pieh2.pgm) Configuration: Cell blade is running at 3.2GHz. DMA operations are not counted in the calculation. Performance numbers are derived from the cycles count collected on a single SPE. Performance numbers: Histogram computation & image mapping(stage 1, 2, 3) combined at 0.50 Gigapixels/second for 100K IBM Confidential 1/17/2019


Download ppt "Histogram Equalization with Cell Broadband Engine™"

Similar presentations


Ads by Google