The Perception Processor

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Computer Architecture & Organization

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

1 CS402 PPP # 1 Computer Architecture Evolution. 2 John Von Neuman original concept.

School of Computing 1 Computer Engineering Senior Projects & Research Overview An informal overview of past & current projects students & my own by Al.

Binu Mathew, Al Davis School of Computing, University of Utah Deepika Ranade Sharanna Chowdhury 1.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

2007 Sept 06SYSC 2001* - Fall SYSC2001-Ch1.ppt1 Computer Architecture & Organization  Instruction set, number of bits used for data representation,

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Computer Organization & Assembly Language © by DR. M. Amer.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

Peter Tummeltshammer, Martin Delvai

1 “A picture speaks a thousand words.” Art By Ranjith & Waquas Islamiah Evening College.

Chapter 3 Data Representation

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Nios II Processor: Memory Organization and Access

Memory COMPUTER ARCHITECTURE

Hiba Tariq School of Engineering

Microarchitecture.

Evaluating Register File Size

Embedded Systems Design

Augmented von Neumann Processors

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

CS 105 Tour of the Black Holes of Computing

INTRODUCTION TO MICROPROCESSORS

Digital Signal Processors

Pipelining and Vector Processing

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Array Processor.

The Microarchitecture of the Pentium 4 processor

Memory Hierarchies.

Cache Memories Topics Cache memory organization Direct mapped caches

Architecture & Organization 1

Performance Optimization for Embedded Software

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Comparison of Two Processors

Alpha Microarchitecture

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

* From AMD 1996 Publication #18522 Revision E

Final Project presentation

What is Computer Architecture?

Introduction to Microprocessor Programming

What is Computer Architecture?

What is Computer Architecture?

Chapter 12 Pipelining and RISC

Lecture 4: Instruction Set Design/Pipelining

A Level Computer Science Topic 5: Computer Architecture and Assembly

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

COMPUTER ORGANIZATION AND ARCHITECTURE

CS 295: Modern Systems Cache And Memory System

CSE378 Introduction to Machine Organization

Presentation transcript:

The Perception Processor Binu Mathew Advisor: Al Davis

What is Perception Processing ? Ubiquitous computing needs natural human interfaces Processor support for perceptual applications Gesture recognition Object detection, recognition, tracking Speech recognition Speaker identification Applications Multi-modal human friendly interfaces Intelligent digital assistants Robotics, unmanned vehicles Perception prosthetics

The Problem with Perception Processing

The Problem with Perception Processing Too slow, too much power for embedded space! 2.4 GHz Pentium 4 ~ 60 Watts 400 MHz Xscale ~ 800 mW 10x or more difference in performance Inadequate memory bandwidth Sphinx requires 1.2 GB/s memory bandwidth Xscale delivers 64 MB/s ~ 1/19th Characterize application to find the problem Derive acceleration architecture History of FPUs is an analogy

High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

Thesis Statement It is possible to design programmable processors that can handle sophisticated perception workloads in real-time at power budgets suitable for embedded devices.

The FaceRec Application

FaceRec In Action Rob Evans

Application Structure Rowley Face Detector Neural Net Eye Locator Eigenfaces Face Recognizer Segment Image Flesh tone Image Viola & Jones Face Detector Identity, Coordinates Flesh toning: Soriano et al, Bertran et al Segmentation: Text book approach Rowley detector, voter: Henry Rowley, CMU Viola & Jones’ detector: Published algorithm + Carbonetto, UBC Eigenfaces: Re-implementation by Colorado State University

FaceRec Characterization ML-RSIM out of order processor simulator SPARC V8 ISA, Unmodified SunOS binaries Out of order processor similar to 2GHz Intel Pentium 4 1-4 ALUs, 1-4 FPUs Max 4 issue Max 4 graduations/cycle 16 KB 2-way L1 I -Cache 16-64 KB 2-way L1 D-Cache 256 KB-2MB 2-way L2 Cache 600 MHz, 64-bit DDR Memory interface In-order processor similar to 400MHz Intel XScale 1 ALU, 1 FPU Max 1 issue Max 1 graduation/cycle 32 KB 32-way L1 I -Cache 32 KB 32-way L1 D-Cache No L2 Cache 100 MHz, 32-bit SDR Memory interface

Application Profile

Memory System Characteristics – L1 D Cache

Memory System Characteristics – L2 Cache

IPC

Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Why is IPC low ? Neural Network Evaluation: Sum = Σn i=0 Weight[i] * Image[ Input[i] ] Result = Tanh(Sum) Dependences – e.g.: no single cycle floating point accumulate Indirect accesses Several array accesses per operator Load store ports saturate Need architectures that can move data efficiently

Real Time Performance

Example App: CMU Sphinx 3.2 Speech recognition engine Speaker and language independent Acoustic model: Triphone based, continuous Hidden Markov Model (HMM) based Grammar: Trigram with back-off Open source HUB4 speech model Broadcast news model (ABC news, NPR etc) 64000 word vocabulary

CMU Sphinx 3.2 Profile

L1 D-cache Miss Rate

L2 Cache Miss Rate

DRAM Bandwidth

IPC

High Level Architecture Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Control Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Access Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Compute Pattern def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply 7 cycle latency def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

ASIC Accelerator Design: Matrix Multiply Interleave >= 7 inner products Complicates address generation def matrix_multiply(A, B, C): # C is the result matrix for i in range(0, 16): for j in range(0, 16): C[i][j] = inner_product(A, B, i, j) def inner_product(A, B, row, col): sum = 0.0 for i in range(0,16): sum = sum + A[row][i] * B[i][col] return sum

How can we generalize ? Decompose loop into: Control pattern Access pattern Compute pattern Programmable h/w acceleration for each pattern

The Perception Processor Architecture Family

Perception Processor Pipeline

Function Unit Organization

Interconnect

Loop Unit

Address Generator A[(i+k1)<<k2+k3][(j+k4)<<k5+k6] A[B[i]]

Inner Product Micro-code i_loop = LoopContext(start_count=0, end_count=15, increment=1, II=7 ) A_ri = AddressContext(port=inq.a_port, loop0=row_loop, rowsize=16, loop1=i_loop, base=0) B_ic = AddressContext(port=inq.b_port, loop0=i_loop, rowsize=16, loop1=Constant, base=256) for i in LOOP(i_loop): t0 = LOAD( fpu0.a_reg, A_ri ) for k in range(0,7): # Will be unrolled 7x AT(t0 + k) t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k) AT(t1) t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg ) AT(t2) t3 = TRANSFER( fpu1.b_reg, fpu0 ) AT(t3) fpu1.add( fpu1, fpu1.b_reg )

Loop Scheduling

Unroll and Software Pipeline

Modulo Scheduling

Modulo Scheduling - Problem i, j i+1, j i+2, j i+3, j

Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences

Traditional Solution Generate multiple copies of address calculation instructions Use register rotation to fix dependences

Array Variable Renaming tag=0 tag=1 tag=2 tag=3

Array Variable Renaming

Array Variable Renaming

Experimental Method Measure processor power on Perception Processor 2.4 GHz Pentium 4, 0.13u process 400 MHz XScale, 0.18u process Perception Processor 1 GHz, 0.13u process (Berkeley Predictive Tech Model) Verilog, MCL HDLs Synthesized using Synopsys Design Compiler Fanout based heuristic wire loads Spice (Nanosim) simulation yields current waveform Numerical integration to calculate energy ASICs in 0.25u process Normalize 0.18u, 0.25u energy and delay numbers

Benchmarks Visual feature recognition Speech recognition DSP Erode, Dilate: Image segmentation opertators Fleshtone: NCC flesh tone detector Viola, Rowley: Face detectors Speech recognition HMM: 5 state Hidden Markov Model GAU: 39 element, 8 mixture Gaussian DSP FFT: 128 point, complex to complex, floating point FIR: 32 tap, integer Encryption Rijndael: 128 bit key, 576 byte packets

Results: IPC Mean IPC = 3.3x R14K

Results: Throughput Mean Throughput = 1.75x Pentium 0.41x ASIC

Results: Energy Mean Energy/packet = 7.4% of XScale 5x of ASIC

Results: Clock Gating Synergy Mean Power Savings = 39.5%

Results: Energy Delay Product Mean EDP = 159x XScale 1/12 of ASIC

The Cost of Generality: PP+ Intel XScale Processor Coprocessor Interface Memory Controller Input SRAMs Custom Accelerator Output SRAM DRAM Interface Scratch SRAMS

Results: Energy of PP+ Mean Energy/packet = 18.2% of XScale 12.4x of ASIC

Results: Energy Delay Product of PP+ Mean EDP = 64x of XScale 1/30x of ASIC

Results: Summary 41% of ASIC’s performance But programmable! 1.75 times the Pentium 4’s throughput But 7.4% of the energy of an XScale!

Related Work Johnny Pihl’s PDF coprocessor Anantharaman and Bisiani Beam search optimization for CMU recognizer SPERT, MultiSPERT, UC Berkeley Corporaal et al’s MOVE processor Transport triggered architecture Vector Chaining (Cray 1) MIT RAW m/c (Agarwal) Stanford Imagine (Dally) Bit reversed addressing modes in DSPs

Contributions Programmable architecture for perception and stream computations Energy efficient, custom flows w/o register files Drastic reductions in power while simultaneously improving ILP Pattern oriented loop accelerators for improved data delivery and throughput Array variable renaming generalizes register rotation Compiler directed data-flow generalizes vector chaining Rapid semi-automated generation of application specific processors Makes real-time low-power perception possible!

Future Work Loop pattern accelerators for more scheduling regimes and data structures Programming language support Automated architecture exploration Generic Stream Processors Architectures for list comprehension, map(), reduce(), filter() in h/w ? e.g.: B = [ (K1*i, K2*i*i) for i in A if i % 2 != 0 ]

Thanks! Questions ?

Future Work

Flesh Toning

Image Segmentation Erosion operator Dilation operator 3 x 3 matrix Remove pixels if all neighbors are not set Removes false connections between objects Dilation operator 5 x 5 matrix Set pixel if any neighbor is set Smoothes out, fill holes in objects Connected components Cut image into rectangles

… Rowley Detector Neural network based 30 x 30 window Neural network based Specialized neurons for horizontal and vertical strips Multiple independent networks for accuracy Typically 100 neurons, 100-150 inputs each Henry Rowley’s implementation provided by CMU … Face or Not face ?

Viola and Jones’ Detector 30 x 30 window Feature Feature/Wavelet based AdaBoost boosting algorithm combines weak heuristics to make stronger ones Feature = Sum/difference of rectangles 100 features Integral image representation Our implementation based on published algorithm

Face Detection Example

Eigenfaces – Face Recognizer Known faces stored as “face space” representation Test image is projected to face space and distance from known face computed Closest distance gives identity of person Matrix multiply and transpose operations, Eigen values Eye co-ordinates provided by neural net Original algorithm by Pentland, MIT Re-implemented by researchers at Colorado State University

L1 Cache Hit Rate - Explanation 320 x 200 color image = Approximately 180 KB Gray scale version = 64 KB Only flesh toning touches color image One pixel at a time Detectors work at 30 x 30 scale Viola – 5.2 KB of tables and image rows Rowley – Approx 80 KB per neural net, but stream mode

A Brief Introduction to Speech Recognition

Sphinx 3 : Profile