The Perception Processor

The Perception Processor
Binu Mathew Advisor: Al Davis

What is Perception Processing ?
Ubiquitous computing needs natural human interfaces Processor support for perceptual applications Gesture recognition Object detection, recognition, tracking Speech recognition Speaker identification Applications Multi-modal human friendly interfaces Intelligent digital assistants Robotics, unmanned vehicles Perception prosthetics

The Problem with Perception Processing

The Problem with Perception Processing
Too slow, too much power for embedded space! 2.4 GHz Pentium 4 ~ 60 Watts 400 MHz Xscale ~ 800 mW 10x or more difference in performance Inadequate memory bandwidth Sphinx requires 1.2 GB/s memory bandwidth Xscale delivers 64 MB/s ~ 1/19th Characterize application to find the problem Derive acceleration architecture History of FPUs is an analogy

High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

Thesis Statement It is possible to design programmable processors that can handle sophisticated perception workloads in real-time at power budgets suitable for embedded devices.

The FaceRec Application

FaceRec In Action Rob Evans

Application Structure
Rowley Face Detector Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Image Viola & Jones Face Detector Identity, Coordinates Flesh toning: Soriano et al, Bertran et al Segmentation: Text book approach Rowley detector, voter: Henry Rowley, CMU Viola & Jones’ detector: Published algorithm + Carbonetto, UBC Eigenfaces: Re-implementation by Colorado State University

FaceRec Characterization
ML-RSIM out of order processor simulator SPARC V8 ISA, Unmodified SunOS binaries Out of order processor similar to 2GHz Intel Pentium 4 1-4 ALUs, 1-4 FPUs Max 4 issue Max 4 graduations/cycle 16 KB 2-way L1 I -Cache 16-64 KB 2-way L1 D-Cache 256 KB-2MB 2-way L2 Cache 600 MHz, 64-bit DDR Memory interface In-order processor similar to 400MHz Intel XScale 1 ALU, 1 FPU Max 1 issue Max 1 graduation/cycle 32 KB 32-way L1 I -Cache 32 KB 32-way L1 D-Cache No L2 Cache 100 MHz, 32-bit SDR Memory interface

Application Profile

Memory System Characteristics – L1 D Cache

Memory System Characteristics – L2 Cache

Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ]
Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) Dependences – e.g.: no single cycle floating point accumulate Indirect accesses Several array accesses per operator Load store ports saturate Need architectures that can move data efficiently

Real Time Performance

Example App: CMU Sphinx 3.2
Speech recognition engine Speaker and language independent Acoustic model: Triphone based, continuous Hidden Markov Model (HMM) based Grammar: Trigram with back-off Open source HUB4 speech model Broadcast news model (ABC news, NPR etc) 64000 word vocabulary

CMU Sphinx 3.2 Profile

L1 D-cache Miss Rate

L2 Cache Miss Rate

DRAM Bandwidth

High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

ASIC Accelerator Design: Matrix Multiply
def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

How can we generalize ? Decompose loop into: Control pattern
Access pattern Compute pattern Programmable h/w acceleration for each pattern

The Perception Processor Architecture Family

Perception Processor Pipeline

Function Unit Organization

Interconnect

Loop Unit

Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]

Inner Product Micro-code
i_loop = LoopContext(start_count=0, end_count=15, increment=1, II=7 ) A_ri = AddressContext(port=inq.a_port, loop0=row_loop, rowsize=16, loop1=i_loop, base=0) B_ic = AddressContext(port=inq.b_port, loop0=i_loop, rowsize=16, loop1=Constant, base=256) for i in LOOP(i_loop): t0 = LOAD( fpu0.a_reg, A_ri ) for k in range(0,7): # Will be unrolled 7x AT(t0 + k) t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k) AT(t1) t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg ) AT(t2) t3 = TRANSFER( fpu1.b_reg, fpu0 ) AT(t3) fpu1.add( fpu1, fpu1.b_reg )

Loop Scheduling

Unroll and Software Pipeline

Modulo Scheduling

Modulo Scheduling - Problem
i, j i+1, j i+2, j i+3, j

Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences

Array Variable Renaming
tag=0 tag=1 tag=2 tag=3

Array Variable Renaming

Experimental Method Measure processor power on Perception Processor
2.4 GHz Pentium 4, 0.13u process 400 MHz XScale, 0.18u process Perception Processor 1 GHz, 0.13u process (Berkeley Predictive Tech Model) Verilog, MCL HDLs Synthesized using Synopsys Design Compiler Fanout based heuristic wire loads Spice (Nanosim) simulation yields current waveform Numerical integration to calculate energy ASICs in 0.25u process Normalize 0.18u, 0.25u energy and delay numbers

Benchmarks Visual feature recognition Speech recognition DSP
Erode, Dilate: Image segmentation opertators Fleshtone: NCC flesh tone detector Viola, Rowley: Face detectors Speech recognition HMM: 5 state Hidden Markov Model GAU: 39 element, 8 mixture Gaussian DSP FFT: 128 point, complex to complex, floating point FIR: 32 tap, integer Encryption Rijndael: 128 bit key, 576 byte packets

Results: IPC Mean IPC = 3.3x R14K

Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC

Results: Energy Mean Energy/packet = 7.4% of XScale 5x of ASIC

Results: Clock Gating Synergy
Mean Power Savings = 39.5%

Results: Energy Delay Product
Mean EDP = 159x XScale 1/12 of ASIC

The Cost of Generality: PP+
Intel XScale Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

Results: Energy of PP+ Mean Energy/packet = 18.2% of XScale
12.4x of ASIC

Results: Energy Delay Product of PP+
Mean EDP = 64x of XScale 1/30x of ASIC

Results: Summary 41% of ASIC’s performance But programmable!
1.75 times the Pentium 4’s throughput But 7.4% of the energy of an XScale!

Related Work Johnny Pihl’s PDF coprocessor Anantharaman and Bisiani
Beam search optimization for CMU recognizer SPERT, MultiSPERT, UC Berkeley Corporaal et al’s MOVE processor Transport triggered architecture Vector Chaining (Cray 1) MIT RAW m/c (Agarwal) Stanford Imagine (Dally) Bit reversed addressing modes in DSPs

Contributions Programmable architecture for perception and stream computations Energy efficient, custom flows w/o register files Drastic reductions in power while simultaneously improving ILP Pattern oriented loop accelerators for improved data delivery and throughput Array variable renaming generalizes register rotation Compiler directed data-flow generalizes vector chaining Rapid semi-automated generation of application specific processors Makes real-time low-power perception possible!

Future Work Loop pattern accelerators for more scheduling regimes and data structures Programming language support Automated architecture exploration Generic Stream Processors Architectures for list comprehension, map(), reduce(), filter() in h/w ? e.g.: B = [ (K1*i, K2*i*i) for i in A if i % 2 != 0 ]

Thanks! Questions ?

Future Work

Flesh Toning

Image Segmentation Erosion operator Dilation operator
3 x 3 matrix Remove pixels if all neighbors are not set Removes false connections between objects Dilation operator 5 x 5 matrix Set pixel if any neighbor is set Smoothes out, fill holes in objects Connected components Cut image into rectangles

… Rowley Detector Neural network based
30 x 30 window Neural network based Specialized neurons for horizontal and vertical strips Multiple independent networks for accuracy Typically 100 neurons, inputs each Henry Rowley’s implementation provided by CMU … Face or Not face ?

Viola and Jones’ Detector
30 x 30 window Feature Feature/Wavelet based AdaBoost boosting algorithm combines weak heuristics to make stronger ones Feature = Sum/difference of rectangles 100 features Integral image representation Our implementation based on published algorithm

Face Detection Example

Eigenfaces – Face Recognizer
Known faces stored as “face space” representation Test image is projected to face space and distance from known face computed Closest distance gives identity of person Matrix multiply and transpose operations, Eigen values Eye co-ordinates provided by neural net Original algorithm by Pentland, MIT Re-implemented by researchers at Colorado State University

L1 Cache Hit Rate - Explanation
320 x 200 color image = Approximately 180 KB Gray scale version = 64 KB Only flesh toning touches color image One pixel at a time Detectors work at 30 x 30 scale Viola – 5.2 KB of tables and image rows Rowley – Approx 80 KB per neural net, but stream mode

A Brief Introduction to Speech Recognition

Sphinx 3 : Profile

The Perception Processor

Similar presentations

Presentation on theme: "The Perception Processor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Perception Processor

Similar presentations

Presentation on theme: "The Perception Processor"— Presentation transcript:

Similar presentations

About project

Feedback