1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Slides:



Advertisements
Similar presentations
Optimization of Parallel Task Execution on the Adaptive Reconfigurable Group Organized Computing System Presenter: Lev Kirischian Department of Electrical.
Advertisements

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Using emulation for RTL performance verification
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Graduate Computer Architecture I Lecture 16: FPGA Design.
Router Architecture : Building high-performance routers Ian Pratt
Dynamic Scan Clock Control In BIST Circuits Priyadharshini Shanmugasundaram Vishwani D. Agrawal
NETWORK ON CHIP ROUTER Students : Itzik Ben - shushan Jonathan Silber Instructor : Isaschar Walter Final presentation part A Winter 2006.
Burleson, UMASS1 Using System-on-a- Chip as a Vehicle for VLSI Design Education Andrew Laffely and Wayne Burleson Electrical and Computer Engineering University.
Priyadharshini Shanmugasundaram Vishwani D. Agrawal DYNAMIC SCAN CLOCK CONTROL FOR TEST TIME REDUCTION MAINTAINING.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Dynamically Parameterized Architectures for Power Aware Video Coding: Motion Estimation and DCT Wayne Burleson Prashant Jain
An FPGA Based Adaptive Viterbi Decoder Sriram Swaminathan Russell Tessier Department of ECE University of Massachusetts Amherst.
Burleson, UMASS1 Adaptive System on a Chip (ASOC): A Backbone for Power-Aware Signal Processing Cores Andrew Laffely, Jian Liang, Russ Tessier and Wayne.
Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
ECE669: Lecture 24 aSoC: A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang, Andrew Laffely, and Wayne Burleson University of Massachusetts,
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Power Reduction for FPGA using Multiple Vdd/Vth
Low-Power Wireless Sensor Networks
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
CSE 661 PAPER PRESENTATION
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.
Veronica Eyo Sharvari Joshi. System on chip Overview Transition from Ad hoc System On Chip design to Platform based design Partitioning the communication.
Integrated Test Data Compression and Core Wrapper Design for Low-Cost System-on-a-Chip Testing Paul Theo Gonciari Bashir Al-Hashimi Electronic Systems.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Yu Cai Ken Mai Onur Mutlu
1 Presenter: Min Yu,Lo 2015/12/21 Kumar, S.; Jantsch, A.; Soininen, J.-P.; Forsell, M.; Millberg, M.; Oberg, J.; Tiensyrja, K.; Hemani, A. VLSI, 2002.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 
Andrea Acquaviva, Luca Benini, Bruno Riccò
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Anne Pratoomtong ECE734, Spring2002
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Dual Mode Logic An approach for high speed and energy efficient design
A High Performance SoC: PkunityTM
Low Power Digital Design
This material is based upon work supported by the National Science Foundation under Grant #XXXXXX. Any opinions, findings, and conclusions or recommendations.
Presentation transcript:

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture Conference 30 Jan 2003 {alaffely, jliang, tessier, moritz, This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

2 Motivation Problem: Need low power architectures for wireless DSP How to support dynamic clock and voltage scaling in heterogeneous systems with data-dependent workloads (granularity, overhead, control) A Solution: Use modularity of SoC; apply at IP core level Apply discrete frequency and voltage scaling Use interconnect utilization measures and data rate requirements to dynamically control scaling

3 Overview Adaptive System-on-a-Chip Implementation Approach Preliminary Results Conclusions and Challenges

4 Adaptive System-on-a-Chip Tiled architecture with mesh interconnect Point to point communication pipeline Allows for heterogeneous cores Differing sizes, clock rates, voltages Low-overhead core interface for On-chip bus substitute for streaming applications Based on static scheduling Fast and predictable  Proc Tile Multiplier FPGA Multiplier ctrl South Core West North East Communication Interface

5 aSoC Implementation  technology Full custom

6 Some Results 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications ~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical ~ 1.5 x the performance of an oblivious routing network 1 (Dynamic routing) Max speedup is 5 x 1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993

7 Dynamic Properties of Statically Routed System?? Dynamically Parameterizable Cores proposed to save power Motion Estimation core (by P. Jain UMASS) changes from 256 cycles/pixel to 16 cycles/pixel based on input data Streams within the scheduled communication pipeline can be blocked and back up or go unused Inefficient to simply run at fastest rate MEDCT Latency Changes with Data Scheduled Communications

8 Key Features for Dynamic Power Reduction SoC Modularity Sets a manageable granularity for voltage scaling Heterogeneous cores Multiple on chip clocks and voltages already supported Core interface already handles synchronization and level conversion Statically scheduled Interconnect traffic indicate system bottlenecks

9 Approach Stream based cores Limited buffering Core-ports Single buffer for each stream to cross clock/voltage barrier between core and interface Reading/Writing success rates indicate core utilization Input blocked: Core too slow Output blocked: Core too fast Controller Interprets core-port success rates to adjust local clock and voltage Interconnect Buffer Input Core-port Output Core-port Core Clock and Supply Controller Local Vdd Local Clock Blocked Processing Pipeline

10 Power-Aware System: Core Utilization Measurement Accumulate failures at each core-port to control clock change Blocked – Add 1 Success – Subtract 1 Threshold and compare input and output failure counts Many input, few output: increase frequency Many output, few input: decrease frequency Many or few of both: do nothing Compare and Threshold Increase or Decrease Local Clock Core countCore-port OutCore-port Incount Out/In Data Interconnect Interface Blocked

11 Power-Aware System: Local Clock Selection Derived from high frequency global clock 8 possible values (Global Clock/2n) Move one up or down each transition /128 /64 /32 /16 /8 /4 /2 /1 count Global Clock From Rate Measurement Core Local Clock

12 Power-Aware System: Voltage Selection System Choose one of 4 supply voltages Look-up-table (LUT) used to match voltage to frequency setting for specific core Using cascading buffers core Vdd can change within 30ns (250nm technology) LUT V1V2V4V3 Core Local Supply From Clock Selector

13 Vdd Selection Criteria Voltage Normalized Delay As Vdd decreases delay increases exponentially Use curve to match available clock frequencies to voltages The voltage drop reduces power by 70%, 84%, and 89% P =  C(Vdd) 2 f Normalized Core Critical Path Delay vs. Vdd Max Speed 1/2 Speed 1/4 Speed 1/8 Speed

14 Power Savings Two core system ME chooses 3 different algorithms based on input data DCT constant rate MEDCT Core power from Synopsys RTL simulation

15 Test System Results Simple test case Core 1 starts 16 x too fast Core 2 starts 8 x too slow Core1Core Core1 Core2 Relative Clock Frequency Number of Clock Cycles

16 Key Issues Count value require to control frequency shifting? May be application and core dependent Core characterization Not easy, data dependent Some tools exist for StrongArm (JouleTrack A. Sinha MIT) Benchmark development A bit tedious

17 Conclusions SoC: a good candidate platform for voltage scaling implementation Convenient granularity Low overhead Easily measurable control mechanism Hardware Preliminary results Now test real benchmarks and data