Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY FabScalar Center for Efficient, Scalable and Reliable Computing (CESR) Department of Electrical and Computer Engineering North Carolina.
Advertisements

1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332,
Final Project : Pipelined Microprocessor Joseph Kim.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures J. Winter and D. Albonesi, Cornell University International Conference on Dependable.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Advanced Computer Architecture Lab University of Michigan MASE Eric Larson MASE: Micro Architectural Simulation Environment Eric Larson, Saugata Chatterjee,
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
1 OS Driven Core Selection for HCMP Systems Anand Bhatia, Rishkul Kulkarni.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
Dec-2009Chip with Multiple Clock and Voltage Domains 1 Multiple Clock and Voltage Domains for Chip Multi Processors December Efraim Rotem Intel.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Superscalar SMIPS Processor Andy Wright Leslie Maldonado.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Core-Selectability in Chip-Multiprocessors Hashem H. Najaf-abadi Niket K. Choudhary Eric Rotenberg.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Pipelining and Parallelism Mark Staveley
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.
CMP Design Space Exploration Subject to Physical Constraints Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, Kevin Skadron HPCA’06 01/27/2010.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg.
PipeliningPipelining Computer Architecture (Fall 2006)
NoCVision: A Network-on-Chip Dynamic Visualization Solution
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
A Review of Processor Design Flow
Improved schedulability on the ρVEX polymorphic VLIW processor
An Automated Design Flow for 3D Microarchitecture Evaluation
A High Performance SoC: PkunityTM
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg

Program 2 Program 1 Heterogeneity Processor A Single-Core:

Program 2 Processor Program 1 Heterogeneity Processor Multiple Cores:

Program 2 Processor Program 1 Heterogeneity Multiple Cores: Processor

Program 1 Program 2 Heterogeneity Processor Heterogeneous Cores:

Heterogeneous CMP Design Must determine: 1) Best processor configuration for a group of workloads. 2) Best way to group workloads together.

The Challenge: A B C D Core 1 Core 2 Workload SpaceBest core configurations Core 1 Core 2 Communal Customization E F G H I J K L M N

Existing Approaches Regression models: Enable speedy exploration. Subsetting: Reduce workloads to a representative subset based on characteristics.

The Argument Subsetting isn’t a valid substitute or facilitator for communal customization. Reason: complex interdependencies between different architectural units.

Ties that bind 1)The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.

Example: The Global Clock solid line: delay of the issue queue, dashed line: access delay of the cache 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Queue 1ns Cache Issue Queue Pipeline: Less slackSlack Pipeline too deep Small Issue-queue Needlessly large cache

Example: The Global Clock The clock period, issue-queue size and cache size can not be optimized independent of each other. 1ns Cache Issue Queue 0.66ns Cache Issue Queue 0.66ns Cache Issue Q 1ns Cache Issue Queue

Ties that bind 1) The global clock intertwines the sizing of different architectural units. 2) The burden of compromise in one unit can be passed on to another.

Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * All normalized to a scale of 0~10 βα γ

Example: Passing on the Burden A) Working-set size B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ LH Speed: Core Cache Core Cache LH LH LH LH Customized Architectures:

Example: Passing on the Burden A) Working-set size, B) Branch predictability C) Density of dependence chains D) Frequency of loads E) Frequency of conditional branches * all normalized to a scale of 0~10 β α γ Speed: Core Cache Core LH LHLH Customized Architectures:

A More Accurate Solution Represent workloads by their customized architectural configurations. Allows for direct and accurate evaluation how well different workloads do on customized configurations. We call this Configurational Workload Characterization

Design Process Overview Important workloads Rep. workloads Optimal core combination Select representative workloads based on workload behavior Search for opt. core combination Important workloads Customized architectures Optimal core combination Customize a core for each workload (configurational characterization) Search for opt. core combination How not to do it How to do it

Pros & Cons -more costly to determine + provides a more optimal design solution + provides a systematic approach + can be performed prior to the design phase that is critical for time-to-market

XP-SCALAR A superscalar design-space exploration frame work www4.ncsu.edu/~hhashem/xpscalar.htm Uses Simplescalar to perform cycle- accurate simulations Uses CACTI model to approximate the access latency of the different units

XP-SCALAR What parameters are varied: Clock period, Processor width, Size of the issue queue, Size of the register-file, Size of the load-store queue, Size of the L1 and L2 caches

XP-SCALAR How they are varied: a) Clock period is varied, and architecture parameters are adjusted to make latencies fit within pipeline stages. b) Number of pipeline stages of a unit is varied and its configuration appropriately adjusted.

Determining the Best cores Execute all benchmarks on each-other’s customized configurations. From that, determine best grouping through a complete search.

Best Core Results customized core(s)avg. IPThar. IPT best config for avg. & har. IPTgcc best configs for avg. IPTparser, twolf best configs for har. IPTgcc, mcf best configs for avg. IPTcrafty, parser, twolf best configs for har. IPTcrafty, mcf, twolf best configs for avg. & har. IPTcrafty, mcf, parser, twolf each benchmark on its own customized architecture

The effect of subsetting Subsetting of a single pair of benchmarks results in the extraction of a totally different set of best cores.

Representation Dendogram are

Conclusions There are interdependencies between architectural units in how they are customized. In the design of a heterogeneous CMP subsetting can lead to performance degradation.