Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.

Slides:

Advertisements

Similar presentations

Tunable Sensors for Process-Aware Voltage Scaling

Advertisements

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Run-Time Storage Organization

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

UC San Diego / VLSI CAD Laboratory Reliability-Constrained Die Stacking Order in 3DICs Under Manufacturing Variability Tuck-Boon Chan, Andrew B. Kahng,

Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.

Accuracy-Configurable Adder for Approximate Arithmetic Designs

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.

Determining the Optimal Process Technology for Performance- Constrained Circuits Michael Boyer & Sudeep Ghosh ECE 563: Introduction to VLSI December 5.

Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

1 Provided By: Ali Teymouri Based on article “Jaguar: A Next-Generation Low-Power x86-64 Core ” Coarse: Custom Implementation of DSP Systems University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Compiler & Microarchitecture Lab Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G. NguyenThiHuong and Seon Wook.

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

-1- UC San Diego / VLSI CAD Laboratory Construction of Realistic Gate Sizing Benchmarks With Known Optimal Solutions Andrew B. Kahng, Seokhyeong Kang VLSI.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

1 Variability.org Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures Abbas Rahimi ‡, Luca Benini †, Rajesh K. Gupta ‡ ‡ UC San Diego,

Run-Time Storage Organization Compiler Design Lecture (03/23/98) Computer Science Rensselaer Polytechnic.

Outline Introduction: BTI Aging and AVS Signoff Problem

Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.

Variation-Tolerant Circuits: Circuit Solutions and Techniques Jim Tschanz, Keith Bowman, and Vivek De Microprocessor Technology Lab Intel Corporation,

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Full and Para Virtualization

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Patricia Gonzalez Divya Akella VLSI Class Project.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.

Sunpyo Hong, Hyesoon Kim

Department of Electrical and Computer Engineering University of Wisconsin - Madison Optimizing Total Power of Many-core Processors Considering Voltage.

-1- UC San Diego / VLSI CAD Laboratory Optimization of Overdrive Signoff Tuck-Boon Chan, Andrew B. Kahng, Jiajia Li and Siddhartha Nath Tuck-Boon Chan,

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs Sri Hari Krishna Narayanan, Mahmut Kandemir, Ozcan Ozturk Embedded Mobile Computing.

1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,

Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Andrea Acquaviva, Luca Benini, Bruno Riccò

Evaluating Register File Size

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.

“Temperature-Aware Task Scheduling for Multicore Processors”

Improving java performance using Dynamic Method Migration on FPGAs

Abbas Rahimi, Luca Benini, Rajesh K. Gupta

Department of Computer Science University of California, Santa Barbara

A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini

A High Performance SoC: PkunityTM

†UCSD, ‡UCSB, EHTZ*, UNIBO*

Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Presentation transcript:

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi ‡, Luca Benini †, and Rajesh Gupta ‡ ‡ CSE, UC San Diego † DEIS, Università di Bologna International Symposium on Low-Power Electronics and Design micrel.deis.unibo.it

Procedure Hopping to Mitigate Variability 2 Main Point

3 Across-wafer Frequency V CC Droop Temperature Clock actual circuit delay guardband Other uncertainty Sources of Device Variation 10% V CC, ~160˚C Temperature, 40% V TH Variations are more challenging in a many-core platform!

Sources of Variations Variation-tolerant Shared-L1 Processor Cluster 1.Process Variation → Variation-aware V DD -hopping 2.Dynamic Voltage Variation → Procedure hopping Methodology for PLV –Design time characterization –Compile time PLV metadata generation –Runtime preventive compensation Experimental Results 4 Outline

Each cluster consists of: 16 LEON-3 cores An intra-cluster shared-L1I$ An on-chip multi-banked tightly coupled data memory (TCDM) Two single-cycle logarithmic interconnections for both instruction and data sides A hardware synchronization handler module (SHM) to coordinate and synchronize cores for accessing shared data on TCDM. V DD -hopping per core. 5 Shared-L1 TCDM cluster template 4x8 cluster: 4 PEs and an 8-bank TCDM Shared-L1 Processor Clusters * * D. Melpignano, L. Benini, et al., “Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications”, DAC’12

 Three cores (f4, f8, f9) cannot meet the target frequency of 830MHz. 6 V DD = 0.81V V DD = 0.99V VA-V DD -Hopping=( 0.81V0.99V, ) f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f V DD –hopping to Compensate Process Variation  All cores of the same cluster meet the target frequency of 830MHz. VA-V DD -hopping can accordingly tune the cores' voltage based on their delay reported by CPMs.

V DD –hopping to Compensate Process Variation 7 Every core have its own voltage domain All cores work with the same frequency V DD -hopping tunes the voltage of each core based on CMP. Each core increases voltage if its delay is high. The process variation is compensated but, cluster will have various Voltage/Temperature-islands! f f f f f f f f f f f f f f f f

The IR-drop of execution of FIR on cores with various operating corners. FIR does not face any voltage emergency (IR-drop < 4%) at the corners with voltages of 0.81V- 0.9V due to their lower power densities. 8 (Vol., Temp.)0.99V, 125C0.90V, 25C0.81V, 125C0.81V, -40C Power density0.66 μW/μm μW/μm μW/μm μW/μm 2 Max IR-drop44 mV< 35 mV< 31 mV Fast Dynamic IR-drop within Cluster

Procedure hopping to Compensate Voltage Variation 9 Procedure hopping facilitates fast and proactive migration of procedures within a cluster to prevent voltage variation thanks to shared I$ and TCDM resources. Each procedure hops from one core to another if it causes voltage variation.

Sources of Variations Variation-tolerant Shared-L1 Processor Cluster 1.Process Variation → Variation-aware V DD -hopping 2.Dynamic Voltage Variation → Procedure hopping Methodology for PLV –Design time characterization –Compile time PLV metadata generation –Runtime preventive compensation Experimental Results 10 Outline

Procedure-level Vulnerability (PLV) The notion of PLV to fast dynamic voltage variation is defined. The design time stage analyzes the dynamic voltage droops/rises for every Proc X under full operating conditions  generating PLV x metadata. 11 int Proc X (…) { … } (V i,T j ) Core i Observe IR-drops (V,T)PLV X V1,T10.75 V2,T20.35 V3,T30.01 ……

Characterization of PLV to IR-drop: Compile time + Runtime 12 At compile time, PLV x metadata of Proc X is attached to the procedure. During runtime, the discretized (V,T) point to the corresponding characterized PLV metadata to assess the vulnerability of Proc X at the current (V,T). If PLV x ≥ PLV_threshold, the Proc X will be hopped from caller core to a favor callee core.

Sources of Variations Variation-tolerant Shared-L1 Processor Cluster 1.Process Variation → Variation-aware V DD -hopping 2.Dynamic Voltage Variation → Procedure hopping Methodology of PLV –Design time characterization –Compile time PLV metadata generation –Runtime preventive compensation Experimental Results 13 Outline

Max Voltage Variation Across Corners and Procedures 14 (Vol., Temp.)a2timFIRIFFTbitmnpcachebIDCTmatrixpntrchPWMsspeedtblookttsprk 0.99V, 125°C V, 25°C V, 125°C V, -40°C Max voltage droop (%) Most of procedures running at cores with 0.99V have voltage emergencies. At 0.9V, only four procedures (IFFT, IDCT, matrix, ttsprk) face the voltage emergencies. No voltage emergency at 0.81V. Procedure hopping avoids the voltage emergency for all procedures by hopping them form a high-voltage core to a low- voltage core.

Cost of Procedure Hopping The total roundtrip overhead of the hopping a procedure from the caller core and returning the results from the callee core is less than 800 cycles. This overhead is less than 1% of the total cycles needed to execute any of the characterized procedures in EEMBC benchmark. During the procedure hopping no voltage emergency can occur even at (0.99V,125˚C), neither in the caller nor the callee core. 15 Caller hopping Caller not hopping Callee service Callee no service Latency218 cycles88 cycles575 cycles342 cycles Voltage droop1.3%0.6%2.9%1.8%

Conclusion The notion of procedure-level vulnerability to fast dynamic voltage variation is defined. Based on PLV metadata, a fully-software low-cost procedure hopping technique is proposed which guarantees the voltage emergency-free migration of all procedures, fast and proactively enough within a shared-L1 processor cluster. Full post-P&R results in 45nm TSMC technology confirms that the procedure hopping avoids the voltage emergency across a variability-affected cluster, while imposing only an amortized cost of less than 1% latency for any of the characterized embedded procedures. 16

17 Thank you! Acknowledgment NSF Variability Expedition ERC Multitherman Project

HW/SW Collaborative Architecture to Support Intra-cluster Procedure Hopping 18 The code is easily accessible via the shared-L1 I$. The data and parameters are passed through the shared stack in TCDM. A procedure hopping information table (PHIT) keeps the status for a migrated procedure.

Intra-procedure Peak Power Variation Maximum of 1.28× intra-corner peak power variation occurs between IFFT and tblook procedures at (0.81V,125C). Maximum inter-corner peak power variation is 3.5× for FIR. Maximum of 4.1× peak power variation across corners and procedures, a2time at (0.81V,-40C), and IFFT at (0.99V,125C). 19