Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Perception Processor

Similar presentations


Presentation on theme: "The Perception Processor"— Presentation transcript:

1 The Perception Processor
Binu Mathew Advisor: Al Davis

2 What is Perception Processing ?
Ubiquitous computing needs natural human interfaces Processor support for perceptual applications Gesture recognition Object detection, recognition, tracking Speech recognition Speaker identification Applications Multi-modal human friendly interfaces Intelligent digital assistants Robotics, unmanned vehicles Perception prosthetics

3 The Problem with Perception Processing

4 The Problem with Perception Processing
Too slow, too much power for embedded space! 2.4 GHz Pentium 4 ~ 60 Watts 400 MHz Xscale ~ 800 mW 10x or more difference in performance Inadequate memory bandwidth Sphinx requires 1.2 GB/s memory bandwidth Xscale delivers 64 MB/s ~ 1/19th Characterize application to find the problem Derive acceleration architecture History of FPUs is an analogy

5 High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

6 Thesis Statement It is possible to design programmable processors that can handle sophisticated perception workloads in real-time at power budgets suitable for embedded devices.

7 The FaceRec Application

8 FaceRec In Action Rob Evans

9 Application Structure
Rowley Face Detector Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Image Viola & Jones Face Detector Identity, Coordinates Flesh toning: Soriano et al, Bertran et al Segmentation: Text book approach Rowley detector, voter: Henry Rowley, CMU Viola & Jones’ detector: Published algorithm + Carbonetto, UBC Eigenfaces: Re-implementation by Colorado State University

10 FaceRec Characterization
ML-RSIM out of order processor simulator SPARC V8 ISA, Unmodified SunOS binaries Out of order processor similar to 2GHz Intel Pentium 4 1-4 ALUs, 1-4 FPUs Max 4 issue Max 4 graduations/cycle 16 KB 2-way L1 I -Cache 16-64 KB 2-way L1 D-Cache 256 KB-2MB 2-way L2 Cache 600 MHz, 64-bit DDR Memory interface In-order processor similar to 400MHz Intel XScale 1 ALU, 1 FPU Max 1 issue Max 1 graduation/cycle 32 KB 32-way L1 I -Cache 32 KB 32-way L1 D-Cache No L2 Cache 100 MHz, 32-bit SDR Memory interface

11 Application Profile

12 Memory System Characteristics – L1 D Cache

13 Memory System Characteristics – L2 Cache

14 IPC

15 Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ]
Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) Dependences – e.g.: no single cycle floating point accumulate Indirect accesses Several array accesses per operator Load store ports saturate Need architectures that can move data efficiently

16 Real Time Performance

17 Example App: CMU Sphinx 3.2
Speech recognition engine Speaker and language independent Acoustic model: Triphone based, continuous Hidden Markov Model (HMM) based Grammar: Trigram with back-off Open source HUB4 speech model Broadcast news model (ABC news, NPR etc) 64000 word vocabulary

18 CMU Sphinx 3.2 Profile

19 L1 D-cache Miss Rate

20 L2 Cache Miss Rate

21 DRAM Bandwidth

22 IPC

23 High Level Architecture
Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

24 ASIC Accelerator Design: Matrix Multiply
def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

25 ASIC Accelerator Design: Matrix Multiply
Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

26 ASIC Accelerator Design: Matrix Multiply
Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

27 ASIC Accelerator Design: Matrix Multiply
Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

28 ASIC Accelerator Design: Matrix Multiply
def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

29 ASIC Accelerator Design: Matrix Multiply
7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

30 ASIC Accelerator Design: Matrix Multiply
Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

31 How can we generalize ? Decompose loop into: Control pattern
Access pattern Compute pattern Programmable h/w acceleration for each pattern

32 The Perception Processor Architecture Family

33 Perception Processor Pipeline

34 Function Unit Organization

35 Interconnect

36 Loop Unit

37 Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]

38 Inner Product Micro-code
i_loop = LoopContext(start_count=0, end_count=15, increment=1, II=7 ) A_ri = AddressContext(port=inq.a_port, loop0=row_loop, rowsize=16, loop1=i_loop, base=0) B_ic = AddressContext(port=inq.b_port, loop0=i_loop, rowsize=16, loop1=Constant, base=256) for i in LOOP(i_loop): t0 = LOAD( fpu0.a_reg, A_ri ) for k in range(0,7): # Will be unrolled 7x AT(t0 + k) t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k) AT(t1) t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg ) AT(t2) t3 = TRANSFER( fpu1.b_reg, fpu0 ) AT(t3) fpu1.add( fpu1, fpu1.b_reg )

39 Loop Scheduling

40 Unroll and Software Pipeline

41 Modulo Scheduling

42 Modulo Scheduling - Problem
i, j i+1, j i+2, j i+3, j

43 Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences

44 Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences

45 Array Variable Renaming
tag=0 tag=1 tag=2 tag=3

46 Array Variable Renaming

47 Array Variable Renaming

48 Experimental Method Measure processor power on Perception Processor
2.4 GHz Pentium 4, 0.13u process 400 MHz XScale, 0.18u process Perception Processor 1 GHz, 0.13u process (Berkeley Predictive Tech Model) Verilog, MCL HDLs Synthesized using Synopsys Design Compiler Fanout based heuristic wire loads Spice (Nanosim) simulation yields current waveform Numerical integration to calculate energy ASICs in 0.25u process Normalize 0.18u, 0.25u energy and delay numbers

49 Benchmarks Visual feature recognition Speech recognition DSP
Erode, Dilate: Image segmentation opertators Fleshtone: NCC flesh tone detector Viola, Rowley: Face detectors Speech recognition HMM: 5 state Hidden Markov Model GAU: 39 element, 8 mixture Gaussian DSP FFT: 128 point, complex to complex, floating point FIR: 32 tap, integer Encryption Rijndael: 128 bit key, 576 byte packets

50 Results: IPC Mean IPC = 3.3x R14K

51 Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC

52 Results: Energy Mean Energy/packet = 7.4% of XScale 5x of ASIC

53 Results: Clock Gating Synergy
Mean Power Savings = 39.5%

54 Results: Energy Delay Product
Mean EDP = 159x XScale 1/12 of ASIC

55 The Cost of Generality: PP+
Intel XScale Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

56 Results: Energy of PP+ Mean Energy/packet = 18.2% of XScale
12.4x of ASIC

57 Results: Energy Delay Product of PP+
Mean EDP = 64x of XScale 1/30x of ASIC

58 Results: Summary 41% of ASIC’s performance But programmable!
1.75 times the Pentium 4’s throughput But 7.4% of the energy of an XScale!

59 Related Work Johnny Pihl’s PDF coprocessor Anantharaman and Bisiani
Beam search optimization for CMU recognizer SPERT, MultiSPERT, UC Berkeley Corporaal et al’s MOVE processor Transport triggered architecture Vector Chaining (Cray 1) MIT RAW m/c (Agarwal) Stanford Imagine (Dally) Bit reversed addressing modes in DSPs

60 Contributions Programmable architecture for perception and stream computations Energy efficient, custom flows w/o register files Drastic reductions in power while simultaneously improving ILP Pattern oriented loop accelerators for improved data delivery and throughput Array variable renaming generalizes register rotation Compiler directed data-flow generalizes vector chaining Rapid semi-automated generation of application specific processors Makes real-time low-power perception possible!

61 Future Work Loop pattern accelerators for more scheduling regimes and data structures Programming language support Automated architecture exploration Generic Stream Processors Architectures for list comprehension, map(), reduce(), filter() in h/w ? e.g.: B = [ (K1*i, K2*i*i) for i in A if i % 2 != 0 ]

62 Thanks! Questions ?

63 Future Work

64 Flesh Toning

65 Image Segmentation Erosion operator Dilation operator
3 x 3 matrix Remove pixels if all neighbors are not set Removes false connections between objects Dilation operator 5 x 5 matrix Set pixel if any neighbor is set Smoothes out, fill holes in objects Connected components Cut image into rectangles

66 … Rowley Detector Neural network based
30 x 30 window Neural network based Specialized neurons for horizontal and vertical strips Multiple independent networks for accuracy Typically 100 neurons, inputs each Henry Rowley’s implementation provided by CMU Face or Not face ?

67 Viola and Jones’ Detector
30 x 30 window Feature Feature/Wavelet based AdaBoost boosting algorithm combines weak heuristics to make stronger ones Feature = Sum/difference of rectangles 100 features Integral image representation Our implementation based on published algorithm

68 Face Detection Example

69 Eigenfaces – Face Recognizer
Known faces stored as “face space” representation Test image is projected to face space and distance from known face computed Closest distance gives identity of person Matrix multiply and transpose operations, Eigen values Eye co-ordinates provided by neural net Original algorithm by Pentland, MIT Re-implemented by researchers at Colorado State University

70 L1 Cache Hit Rate - Explanation
320 x 200 color image = Approximately 180 KB Gray scale version = 64 KB Only flesh toning touches color image One pixel at a time Detectors work at 30 x 30 scale Viola – 5.2 KB of tables and image rows Rowley – Approx 80 KB per neural net, but stream mode

71 A Brief Introduction to Speech Recognition

72 Sphinx 3 : Profile


Download ppt "The Perception Processor"

Similar presentations


Ads by Google