Chris Savarese, Yashesh Shroff, Greg Lawrence

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Xtensa C and C++ Compiler Ding-Kai Chen
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Instruction Level Parallelism (ILP) Colin Stevens.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Processor Architecture Kieran Mathieson. Outline Memory CPU Structure Design a CPU Programming Design Issues.
CS402 PPP # 2 MIPS BASIC INFORMATION By George Koutsogiannakis 1.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.
Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Machine Language Alex Ostrovsky. 2 Introduction Hierarchy of computer languages: 1. Application-Specific Language (Matlab compiler) 2. High-Level Programming.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Develop and Implementation of the Speex Vocoder on the TI C64+ DSP
Encryption for Mobile Computing By Erik Olson Woojin Yu.
Introduction of Intel Processors
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
Software Defined Radio 長庚電機通訊組 碩一 張晉銓 指導教授 : 黃文傑博士.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
EE421, Fall 1998 Michigan Technological University Timothy J. Schulz 29-Sept, 1998EE421, Lecture 61 Lecture 6 - Sample Processing Methods l Basic building.
Computer Architecture Lecture 32 Fasih ur Rehman.
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
C. Savarese, J. Beutel, J. Rabaey; UC BerkeleyICASSP Locationing in Distributed Ad-hoc Wireless Sensor Networks Chris Savarese, Jan Beutel, Jan Rabaey.
Short Cuts for Multiply and Divide For Positive Numbers 1. Multiply by 2 k is the same as shift k to the left, 0 fill 2. Divide by 2 k is the same as.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
CS2100 Computer Organization
CORDIC (Coordinate rotation digital computer)
CS 232: Computer Architecture II
CS161 – Design and Architecture of Computer Systems
Embedded Systems Design
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
Cache Memory Presentation I
Lecture 5: GPU Compute Architecture
Digital Signal Processors
Implementation of IDEA on a Reconfigurable Computer
Subject Name: Digital Signal Processing Algorithms & Architecture
Lecture 41: Introduction to Reconfigurable Computing
CS294-1 Reading Aug 28, 2003 Jaein Jeong
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 5: GPU Compute Architecture for the last time
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
STUDY AND IMPLEMENTATION
Multivector and SIMD Computers
A.R. Hurson 323 CS Building, Missouri S&T
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Final Project presentation
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Mapping DSP algorithms to a general purpose out-of-order processor
Implementation of a De-blocking Filter and Optimization in PLX
Overview Problem Solution CPU vs Memory performance imbalance
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Chris Savarese, Yashesh Shroff, Greg Lawrence MAP ART Mapping Architectural Properties to an Algorithm for Redundant Triangulation Chris Savarese, Yashesh Shroff, Greg Lawrence Advisor: Dr. Jan Rabaey April 27, 2000 CS252

Outline Introduction Background Time and Energy Profiling Parallel Architectures Conclusions: Our Dream Architecture Future Work

Introduction Goal: Given a basic localization algorithm, explore architectural alternatives for the minimization of energy consumption. The concept of localization Energy saving techniques What we did…

Outline Introduction Background Time and Energy Profiling Background Parallel Architectures Conclusions: Our Dream Architecture Future Work Background

The Localization Algorithm U N2 N1 N3 N1(x1,y1,z1) N2(x2,y2,z2) U (x,y,z) N3(x3,y3,z3) (x1-xn) (y1-yn) (z1-zn) (xn-1-xn) (yn-1-yn) (zn-1-zn) . .. x y z = b1 bn-1 Am3 U31 Bn-11 [Am3] [Qm3] ·[R33] Solve: U = R-1QT b QRdcmp()

The StrongARM Architecture Power: 200mW, 0.25m, 1.5V Clock Speed: 200 MHz Cache: 16 KB I-cache 8 KB D-cache 32-way set-associative, round-robin replacement 512B, 2-way Minicache 31/16 GPR (32-bit) Auto-increment addressing No FP processor MAC

The Tensilica Xtensa Architecture Processor Configuration Power: 200mW, 0.25 m, 1.5V Clock Speed: 170 MHz Cache: 16 KB I-cache 16 KB D-cache Direct mapped 32 Registers (32-bits) Xtensibility  Use of TIE instructions No FP processor Zero overhead loops

Outline Introduction Background Time and Energy Profiling Parallel Architectures Conclusions: Our Dream Architecture Future Work Time and Energy Profiling

Profiling Results Profiler Output: StrongARM Processor 68J ----------------------------------------------- _fmul 18.21% 18.21% 0.00% 188000 lubksb 15.27% 5.17% 10.10% 10000 _fneq 0.37% 0.00% 14000 _fdiv 4.23% 0.00% 30000 _fmul 5.03% 0.00% 52000 _frsb 0.46% 0.00% 52000 StrongARM Processor 68J Xtensa Processor 144J Floating Point Energy = nom. core power  #cycles  clock period

Fixed Point Arithmetic Floating Point vs. Fixed Point Add / Sub are straightforward Multiply / Divide require shifting Why can we use it for localization? Low accuracy requirements Limited range in measurements (< 10m) Small matrices  small error propagation 0000 . 0000 16 16 S E Mantissa 1 8 23

Fixed Point Profiling Results Profiler Output: ----------------------------------------------- _fmul 18.21% 18.21% 0.00% 188000 lubksb 15.27% 5.17% 10.10% 10000 _fneq 0.37% 0.00% 14000 _fdiv 4.23% 0.00% 30000 _fmul 5.03% 0.00% 52000 _frsb 0.46% 0.00% 52000 StrongARM Processor 68J Xtensa Processor 144J Floating Point StrongARM Processor 43J Xtensa Processor 69J Fixed Point (37% less) (52% less) Energy = nom. core power  #cycles  clock period

Outline Introduction Background Time and Energy Profiling Parallel Architectures Conclusions: Our Dream Architecture Future Work Parallel Architectures

Parallel Architectures - Write sequential code in Matlab - Extract data-dependencies - Workload analysis CP1 CP2 CP3 P

Outline Introduction Background Time and Energy Profiling Parallel Architectures Conclusions: Our Dream Architecture Future Work Conclusions: Our Dream Architecture

Our Dream Architecture Floating point hardware MAC hardware Zero overhead loops Auto increment Register file size Cache  Direct mapped

Future Work FPGA implementation Xtensa customizations TIE instructions Floating Point Coprocessor Realistic algorithm for PicoRadio

Many Thanks To… Dr. Bart Kienhuis, EECS Post Doc Ptolemy and other tools: Parallel issues Fred Burghardt, BWRC Technical Staff PicoRadio Testbed Marlene Wan, BWRC Student StrongARM Energy Profiling Vandana Prabhu, BWRC Student Tensilica Tools The Berkeley Wireless Research Center