Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep.

Slides:



Advertisements
Similar presentations
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Advertisements

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Thoughts on Shared Caches Jeff Odom University of Maryland.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
1 OS Driven Core Selection for HCMP Systems Anand Bhatia, Rishkul Kulkarni.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Chapter Hardwired vs Microprogrammed Control Multithreading
Multiscalar processors
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Maestro: Orchestrating.
University of Michigan Electrical Engineering and Computer Science 1 Online Timing Analysis for Wearout Detection Jason Blome, Shuguang Feng, Shantanu.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Applying Control Theory to the Caches of Multiprocessors Department of EECS University of Tennessee, Knoxville Kai Ma.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Classic Model of Parallel Processing
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Sunpyo Hong, Hyesoon Kim
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Adaptive Cache Partitioning on a Composite Core
Architecture Background
Multi-Core Computing Osama Awwad Department of Computer Science
Hyperthreading Technology
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Phase Capture and Prediction with Applications
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Aravindh Anantaraman*, Kiran Seth†, Eric Rotenberg*, Frank Mueller‡
Department of Computer Science University of California, Santa Barbara
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand I2PC March 28, 2013 Amin Ansari 1, Shuguang Feng 2, Shantanu Gupta 3, Josep Torrellas 1, and Scott Mahlke 4 1 University of Illinois, Urbana-Champaign 2 Northrop Grumman Corp. 3 Intel Corp. 4 University of Michigan, Ann Arbor

Adapting to Application Demands  Number of threads to execute is not constant o Many threads available  System with many lightweight cores achieves a better throughput o Few threads available  System with aggressive cores achieves a better throughput o Single-thread performance is always better with aggressive cores  Asymmetric Chip Multiprocessors (ACMPs): o Adapt to the variability in the number of threads o Limited in that there is no dynamic adaptation  To provide dynamic adaptation: o We use core coupling 2 Core 1 Performance Core 2 communication

3 Core Coupling  Typically configured as leader/follower cores where the leader runs ahead and attempts to accelerates the follower o Slipstream o Master/slave Speculation o Flea Flicker o Dual-core Execution o Paceline o DIVA The leader runs ahead by executing a “pruned” version of the application The leader speculates on long-latency operations The leader is aggressively frequency scaled (reduced safety margins) A smaller follower core simplifies the verification of the leader core

Extending Core Coupling Aggressive Core (AC) Lightweight Core (LWC) Lightweight Core Throughput Configuration Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Hints A 9 Core ACMP System 4 9 core ACMP 7 LWCs + a coupled cores Illusionist

Illusionist vs Prior Work Aggressive Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Lightweight Core Hints  Higher single-thread performance for all LWCs o By using a single aggressive core o Giving the appearance of 8 semi-aggressive cores 5

Illusionist vs Prior Work MasterSlave1Slave2Slave3 A’ A B’ B C C’ C Master Slave Parallelization [Zilles’02] 6

Providing Hints for Many Cores  Original IPC of the aggressive core ~2X of that of a LWC  We want an AC to keep up with a large number of LWCs o We need to substantially reduce the amount of work that the aggressive core needs to do per each thread running on a LWC  We need to run lower num of instructions per each thread o We distill the program that the aggressive core needs to run o We limit the execution of the program only to most fruitful parts  The main challenge here is to o Preserve the effectiveness of the hints while removing instructions 7

Program Distillation  Objective: reduce the size of program while preserving the effectiveness of the original hints (branch prediction and cache hits)  Distillation techniques o Aggressive instruction removal (on average, 77% )  Remove instructions which do not contribute to hint generation  Remove highly biased branches and their back slice  Remove memory inst. accessing the same cache line o Select the most promising program phases  Predictor that uses performance counters  Regression model based on IPC, $ and BP miss rates 8

Example of Instruction Removal 9 if (high<=low) return; srand(10); for (i=low;i<high;i++) { for (j=0;j<numf1s;j++) { if (i%low) { tds[j][i] = tds[j][0]; tds[j][i] = bus[j][0]; } else { tds[j][i] = tds[j][1]; tds[j][i] = bus[j][1]; } for (i=low;i<high;i++) { for (j=0;j<numf1s;j++) { noise1 = (double)(rand()&0xffff); noise2 = noise1/(double)0xffff; tds[j][i] += noise2; bus[j][i] += noise2; } … for (i=low;i<high;i=i+4) { for (j=0;j<numf1s;j++) { noise1 = (double)(rand()&0xffff); noise2 = noise1/(double)0xffff; tds[j][i] += noise2; bus[j][i] += noise2; } for (j=0;j<numf1s;j++) { noise1 = (double)(rand()&0xffff); noise2 = noise1/(double)0xffff; tds[j][i+1] += noise2; bus[j][i+1] += noise2; } for (j=0;j<numf1s;j++) { noise1 = (double)(rand()&0xffff); noise2 = noise1/(double)0xffff; tds[j][i+2] += noise2; bus[j][i+2] += noise2; } for (j=0;j<numf1s;j++) { noise1 = (double)(rand()&0xffff); noise2 = noise1/(double)0xffff; tds[j][i+3] += noise2; bus[j][i+3] += noise2; } srand(10); for (i=low;i<high;i=i+4) { for (j=0;j<numf1s;j++) { tds[j][i] = tds[j][1]; tds[j][i] = bus[j][1]; } for (i=low;i<high;i=i+4) { for (j=0;j<numf1s;j++) { tds[j][i] = noise2; bus[j][i] = noise2; } Original code Distilled code 179.art

Hint Phases 10 If we can predict these phases without actually running the program on both lightweight and aggressive cores, we can limit the dual core execution only to the most useful phases Performance(accelerated LWC) / Performance(original LWC) Groups of 10K instr

Phase Prediction 11  Phase predictor : o does a decent job predicting the IPC trend o can sit either in the hypervisor or operating system and reads the performance counters while the threads running  Aggressive core runs the thread that will benefit the most

Illusionist: Core Coupling Architecture 12 Aggressive Core L1-Data Shared L2 cache Read-Only Lightweight Core L1-Data Hint Gathering FET Memory Hierarchy Queue tailhead DECRENDISEXEMEMCOM FEDEREDIEXMECO Hint Distribution L1-Inst Cache Fingerprint Hint Disabling Resynchronization signal and hint disabling information

Illusionist System 13 Cluster 1 L2 Cache Banks Data Switch L2 Cache Banks Cluster 2 Cluster 3 Cluster 4 Aggressive Core Queue Hint Gathering Queue Lightweight Core Queue Lightweight Core Queue Lightweight Core Queue

Experimental Methodology 14  Performance : Heavily modified SimAlpha o Instruction removal and phase-based program pruning o SPEC-CPU-2K with SimPoint  Power : Wattch, HotLeakage, and CACTI  Area : Synopsys toolchain + 90nm TSMC

Performance After Acceleration On average, 43% speedup compared to a LWC 15

Instruction Type Breakdown In most benchmarks, the breakdowns are similar. 16 b: before distillation a: after distillation

17 Area-Neutral Comparison of Alternatives More Lightweight Cores 34% 2X 1610

Conclusion 18  On-demand acceleration of lightweight cores o using a few aggressive cores  Aggressive core keeps up with many LWCs by o Aggressive inst. removal with a minimal impact on the hints o Phase-based program pruning based on hint effectiveness  Illusionist provides an interesting design point o Compared to a CMP with only lightweight cores  35% better single thread performance per thread o Compared to a CMP with only aggressive cores  2X better system throughput