The Perception Processor Binu Mathew Advisor: Al Davis
What is Perception Processing ? Ubiquitous computing needs natural human interfaces Processor support for perceptual applications Gesture recognition Object detection, recognition, tracking Speech recognition Speaker identification Applications Multi-modal human friendly interfaces Intelligent digital assistants Robotics, unmanned vehicles Perception prosthetics
The Problem with Perception Processing
The Problem with Perception Processing Too slow, too much power for embedded space! 2.4 GHz Pentium 4 ~ 60 Watts 400 MHz Xscale ~ 800 mW 10x or more difference in performance Inadequate memory bandwidth Sphinx requires 1.2 GB/s memory bandwidth Xscale delivers 64 MB/s ~ 1/19th Characterize application to find the problem Derive acceleration architecture History of FPUs is an analogy
High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
Thesis Statement It is possible to design programmable processors that can handle sophisticated perception workloads in real-time at power budgets suitable for embedded devices.
The FaceRec Application
FaceRec In Action Rob Evans
Application Structure Rowley Face Detector Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Image Viola & Jones Face Detector Identity, Coordinates Flesh toning: Soriano et al, Bertran et al Segmentation: Text book approach Rowley detector, voter: Henry Rowley, CMU Viola & Jones’ detector: Published algorithm + Carbonetto, UBC Eigenfaces: Re-implementation by Colorado State University
FaceRec Characterization ML-RSIM out of order processor simulator SPARC V8 ISA, Unmodified SunOS binaries Out of order processor similar to 2GHz Intel Pentium 4 1-4 ALUs, 1-4 FPUs Max 4 issue Max 4 graduations/cycle 16 KB 2-way L1 I -Cache 16-64 KB 2-way L1 D-Cache 256 KB-2MB 2-way L2 Cache 600 MHz, 64-bit DDR Memory interface In-order processor similar to 400MHz Intel XScale 1 ALU, 1 FPU Max 1 issue Max 1 graduation/cycle 32 KB 32-way L1 I -Cache 32 KB 32-way L1 D-Cache No L2 Cache 100 MHz, 32-bit SDR Memory interface
Application Profile
Memory System Characteristics – L1 D Cache
Memory System Characteristics – L2 Cache
IPC
Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) Dependences – e.g.: no single cycle floating point accumulate Indirect accesses Several array accesses per operator Load store ports saturate Need architectures that can move data efficiently
Real Time Performance
Example App: CMU Sphinx 3.2 Speech recognition engine Speaker and language independent Acoustic model: Triphone based, continuous Hidden Markov Model (HMM) based Grammar: Trigram with back-off Open source HUB4 speech model Broadcast news model (ABC news, NPR etc) 64000 word vocabulary
CMU Sphinx 3.2 Profile
L1 D-cache Miss Rate
L2 Cache Miss Rate
DRAM Bandwidth
IPC
High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply 7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
ASIC Accelerator Design: Matrix Multiply Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
How can we generalize ? Decompose loop into: Control pattern Access pattern Compute pattern Programmable h/w acceleration for each pattern
The Perception Processor Architecture Family
Perception Processor Pipeline
Function Unit Organization
Interconnect
Loop Unit
Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]
Inner Product Micro-code i_loop = LoopContext(start_count=0, end_count=15, increment=1, II=7 ) A_ri = AddressContext(port=inq.a_port, loop0=row_loop, rowsize=16, loop1=i_loop, base=0) B_ic = AddressContext(port=inq.b_port, loop0=i_loop, rowsize=16, loop1=Constant, base=256) for i in LOOP(i_loop): t0 = LOAD( fpu0.a_reg, A_ri ) for k in range(0,7): # Will be unrolled 7x AT(t0 + k) t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k) AT(t1) t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg ) AT(t2) t3 = TRANSFER( fpu1.b_reg, fpu0 ) AT(t3) fpu1.add( fpu1, fpu1.b_reg )
Loop Scheduling
Unroll and Software Pipeline
Modulo Scheduling
Modulo Scheduling - Problem i, j i+1, j i+2, j i+3, j
Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences
Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences
Array Variable Renaming tag=0 tag=1 tag=2 tag=3
Array Variable Renaming
Array Variable Renaming
Experimental Method Measure processor power on Perception Processor 2.4 GHz Pentium 4, 0.13u process 400 MHz XScale, 0.18u process Perception Processor 1 GHz, 0.13u process (Berkeley Predictive Tech Model) Verilog, MCL HDLs Synthesized using Synopsys Design Compiler Fanout based heuristic wire loads Spice (Nanosim) simulation yields current waveform Numerical integration to calculate energy ASICs in 0.25u process Normalize 0.18u, 0.25u energy and delay numbers
Benchmarks Visual feature recognition Speech recognition DSP Erode, Dilate: Image segmentation opertators Fleshtone: NCC flesh tone detector Viola, Rowley: Face detectors Speech recognition HMM: 5 state Hidden Markov Model GAU: 39 element, 8 mixture Gaussian DSP FFT: 128 point, complex to complex, floating point FIR: 32 tap, integer Encryption Rijndael: 128 bit key, 576 byte packets
Results: IPC Mean IPC = 3.3x R14K
Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC
Results: Energy Mean Energy/packet = 7.4% of XScale 5x of ASIC
Results: Clock Gating Synergy Mean Power Savings = 39.5%
Results: Energy Delay Product Mean EDP = 159x XScale 1/12 of ASIC
The Cost of Generality: PP+ Intel XScale Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
Results: Energy of PP+ Mean Energy/packet = 18.2% of XScale 12.4x of ASIC
Results: Energy Delay Product of PP+ Mean EDP = 64x of XScale 1/30x of ASIC
Results: Summary 41% of ASIC’s performance But programmable! 1.75 times the Pentium 4’s throughput But 7.4% of the energy of an XScale!
Related Work Johnny Pihl’s PDF coprocessor Anantharaman and Bisiani Beam search optimization for CMU recognizer SPERT, MultiSPERT, UC Berkeley Corporaal et al’s MOVE processor Transport triggered architecture Vector Chaining (Cray 1) MIT RAW m/c (Agarwal) Stanford Imagine (Dally) Bit reversed addressing modes in DSPs
Contributions Programmable architecture for perception and stream computations Energy efficient, custom flows w/o register files Drastic reductions in power while simultaneously improving ILP Pattern oriented loop accelerators for improved data delivery and throughput Array variable renaming generalizes register rotation Compiler directed data-flow generalizes vector chaining Rapid semi-automated generation of application specific processors Makes real-time low-power perception possible!
Future Work Loop pattern accelerators for more scheduling regimes and data structures Programming language support Automated architecture exploration Generic Stream Processors Architectures for list comprehension, map(), reduce(), filter() in h/w ? e.g.: B = [ (K1*i, K2*i*i) for i in A if i % 2 != 0 ]
Thanks! Questions ?
Future Work
Flesh Toning
Image Segmentation Erosion operator Dilation operator 3 x 3 matrix Remove pixels if all neighbors are not set Removes false connections between objects Dilation operator 5 x 5 matrix Set pixel if any neighbor is set Smoothes out, fill holes in objects Connected components Cut image into rectangles
… Rowley Detector Neural network based 30 x 30 window Neural network based Specialized neurons for horizontal and vertical strips Multiple independent networks for accuracy Typically 100 neurons, 100-150 inputs each Henry Rowley’s implementation provided by CMU … Face or Not face ?
Viola and Jones’ Detector 30 x 30 window Feature Feature/Wavelet based AdaBoost boosting algorithm combines weak heuristics to make stronger ones Feature = Sum/difference of rectangles 100 features Integral image representation Our implementation based on published algorithm
Face Detection Example
Eigenfaces – Face Recognizer Known faces stored as “face space” representation Test image is projected to face space and distance from known face computed Closest distance gives identity of person Matrix multiply and transpose operations, Eigen values Eye co-ordinates provided by neural net Original algorithm by Pentland, MIT Re-implemented by researchers at Colorado State University
L1 Cache Hit Rate - Explanation 320 x 200 color image = Approximately 180 KB Gray scale version = 64 KB Only flesh toning touches color image One pixel at a time Detectors work at 30 x 30 scale Viola – 5.2 KB of tables and image rows Rowley – Approx 80 KB per neural net, but stream mode
A Brief Introduction to Speech Recognition
Sphinx 3 : Profile