Download presentation
Presentation is loading. Please wait.
1
The Perception Processor
Binu Mathew Advisor: Al Davis
2
What is Perception Processing ?
Ubiquitous computing needs natural human interfaces Processor support for perceptual applications Gesture recognition Object detection, recognition, tracking Speech recognition Speaker identification Applications Multi-modal human friendly interfaces Intelligent digital assistants Robotics, unmanned vehicles Perception prosthetics
3
The Problem with Perception Processing
4
The Problem with Perception Processing
Too slow, too much power for embedded space! 2.4 GHz Pentium 4 ~ 60 Watts 400 MHz Xscale ~ 800 mW 10x or more difference in performance Inadequate memory bandwidth Sphinx requires 1.2 GB/s memory bandwidth Xscale delivers 64 MB/s ~ 1/19th Characterize application to find the problem Derive acceleration architecture History of FPUs is an analogy
5
High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
6
Thesis Statement It is possible to design programmable processors that can handle sophisticated perception workloads in real-time at power budgets suitable for embedded devices.
7
The FaceRec Application
8
FaceRec In Action Rob Evans
9
Application Structure
Rowley Face Detector Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Image Viola & Jones Face Detector Identity, Coordinates Flesh toning: Soriano et al, Bertran et al Segmentation: Text book approach Rowley detector, voter: Henry Rowley, CMU Viola & Jones’ detector: Published algorithm + Carbonetto, UBC Eigenfaces: Re-implementation by Colorado State University
10
FaceRec Characterization
ML-RSIM out of order processor simulator SPARC V8 ISA, Unmodified SunOS binaries Out of order processor similar to 2GHz Intel Pentium 4 1-4 ALUs, 1-4 FPUs Max 4 issue Max 4 graduations/cycle 16 KB 2-way L1 I -Cache 16-64 KB 2-way L1 D-Cache 256 KB-2MB 2-way L2 Cache 600 MHz, 64-bit DDR Memory interface In-order processor similar to 400MHz Intel XScale 1 ALU, 1 FPU Max 1 issue Max 1 graduation/cycle 32 KB 32-way L1 I -Cache 32 KB 32-way L1 D-Cache No L2 Cache 100 MHz, 32-bit SDR Memory interface
11
Application Profile
12
Memory System Characteristics – L1 D Cache
13
Memory System Characteristics – L2 Cache
14
IPC
15
Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ]
Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) Dependences – e.g.: no single cycle floating point accumulate Indirect accesses Several array accesses per operator Load store ports saturate Need architectures that can move data efficiently
16
Real Time Performance
17
Example App: CMU Sphinx 3.2
Speech recognition engine Speaker and language independent Acoustic model: Triphone based, continuous Hidden Markov Model (HMM) based Grammar: Trigram with back-off Open source HUB4 speech model Broadcast news model (ABC news, NPR etc) 64000 word vocabulary
18
CMU Sphinx 3.2 Profile
19
L1 D-cache Miss Rate
20
L2 Cache Miss Rate
21
DRAM Bandwidth
22
IPC
23
High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
24
ASIC Accelerator Design: Matrix Multiply
def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
25
ASIC Accelerator Design: Matrix Multiply
Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
26
ASIC Accelerator Design: Matrix Multiply
Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
27
ASIC Accelerator Design: Matrix Multiply
Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
28
ASIC Accelerator Design: Matrix Multiply
def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
29
ASIC Accelerator Design: Matrix Multiply
7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
30
ASIC Accelerator Design: Matrix Multiply
Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum
31
How can we generalize ? Decompose loop into: Control pattern
Access pattern Compute pattern Programmable h/w acceleration for each pattern
32
The Perception Processor Architecture Family
33
Perception Processor Pipeline
34
Function Unit Organization
35
Interconnect
36
Loop Unit
37
Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]
38
Inner Product Micro-code
i_loop = LoopContext(start_count=0, end_count=15, increment=1, II=7 ) A_ri = AddressContext(port=inq.a_port, loop0=row_loop, rowsize=16, loop1=i_loop, base=0) B_ic = AddressContext(port=inq.b_port, loop0=i_loop, rowsize=16, loop1=Constant, base=256) for i in LOOP(i_loop): t0 = LOAD( fpu0.a_reg, A_ri ) for k in range(0,7): # Will be unrolled 7x AT(t0 + k) t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k) AT(t1) t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg ) AT(t2) t3 = TRANSFER( fpu1.b_reg, fpu0 ) AT(t3) fpu1.add( fpu1, fpu1.b_reg )
39
Loop Scheduling
40
Unroll and Software Pipeline
41
Modulo Scheduling
42
Modulo Scheduling - Problem
i, j i+1, j i+2, j i+3, j
43
Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences
44
Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences
45
Array Variable Renaming
tag=0 tag=1 tag=2 tag=3
46
Array Variable Renaming
47
Array Variable Renaming
48
Experimental Method Measure processor power on Perception Processor
2.4 GHz Pentium 4, 0.13u process 400 MHz XScale, 0.18u process Perception Processor 1 GHz, 0.13u process (Berkeley Predictive Tech Model) Verilog, MCL HDLs Synthesized using Synopsys Design Compiler Fanout based heuristic wire loads Spice (Nanosim) simulation yields current waveform Numerical integration to calculate energy ASICs in 0.25u process Normalize 0.18u, 0.25u energy and delay numbers
49
Benchmarks Visual feature recognition Speech recognition DSP
Erode, Dilate: Image segmentation opertators Fleshtone: NCC flesh tone detector Viola, Rowley: Face detectors Speech recognition HMM: 5 state Hidden Markov Model GAU: 39 element, 8 mixture Gaussian DSP FFT: 128 point, complex to complex, floating point FIR: 32 tap, integer Encryption Rijndael: 128 bit key, 576 byte packets
50
Results: IPC Mean IPC = 3.3x R14K
51
Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC
52
Results: Energy Mean Energy/packet = 7.4% of XScale 5x of ASIC
53
Results: Clock Gating Synergy
Mean Power Savings = 39.5%
54
Results: Energy Delay Product
Mean EDP = 159x XScale 1/12 of ASIC
55
The Cost of Generality: PP+
Intel XScale Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS
56
Results: Energy of PP+ Mean Energy/packet = 18.2% of XScale
12.4x of ASIC
57
Results: Energy Delay Product of PP+
Mean EDP = 64x of XScale 1/30x of ASIC
58
Results: Summary 41% of ASIC’s performance But programmable!
1.75 times the Pentium 4’s throughput But 7.4% of the energy of an XScale!
59
Related Work Johnny Pihl’s PDF coprocessor Anantharaman and Bisiani
Beam search optimization for CMU recognizer SPERT, MultiSPERT, UC Berkeley Corporaal et al’s MOVE processor Transport triggered architecture Vector Chaining (Cray 1) MIT RAW m/c (Agarwal) Stanford Imagine (Dally) Bit reversed addressing modes in DSPs
60
Contributions Programmable architecture for perception and stream computations Energy efficient, custom flows w/o register files Drastic reductions in power while simultaneously improving ILP Pattern oriented loop accelerators for improved data delivery and throughput Array variable renaming generalizes register rotation Compiler directed data-flow generalizes vector chaining Rapid semi-automated generation of application specific processors Makes real-time low-power perception possible!
61
Future Work Loop pattern accelerators for more scheduling regimes and data structures Programming language support Automated architecture exploration Generic Stream Processors Architectures for list comprehension, map(), reduce(), filter() in h/w ? e.g.: B = [ (K1*i, K2*i*i) for i in A if i % 2 != 0 ]
62
Thanks! Questions ?
63
Future Work
64
Flesh Toning
65
Image Segmentation Erosion operator Dilation operator
3 x 3 matrix Remove pixels if all neighbors are not set Removes false connections between objects Dilation operator 5 x 5 matrix Set pixel if any neighbor is set Smoothes out, fill holes in objects Connected components Cut image into rectangles
66
… Rowley Detector Neural network based
30 x 30 window Neural network based Specialized neurons for horizontal and vertical strips Multiple independent networks for accuracy Typically 100 neurons, inputs each Henry Rowley’s implementation provided by CMU … Face or Not face ?
67
Viola and Jones’ Detector
30 x 30 window Feature Feature/Wavelet based AdaBoost boosting algorithm combines weak heuristics to make stronger ones Feature = Sum/difference of rectangles 100 features Integral image representation Our implementation based on published algorithm
68
Face Detection Example
69
Eigenfaces – Face Recognizer
Known faces stored as “face space” representation Test image is projected to face space and distance from known face computed Closest distance gives identity of person Matrix multiply and transpose operations, Eigen values Eye co-ordinates provided by neural net Original algorithm by Pentland, MIT Re-implemented by researchers at Colorado State University
70
L1 Cache Hit Rate - Explanation
320 x 200 color image = Approximately 180 KB Gray scale version = 64 KB Only flesh toning touches color image One pixel at a time Detectors work at 30 x 30 scale Viola – 5.2 KB of tables and image rows Rowley – Approx 80 KB per neural net, but stream mode
71
A Brief Introduction to Speech Recognition
72
Sphinx 3 : Profile
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.