CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.

Slides:



Advertisements
Similar presentations
To Include or Not to Include? Natalie Enright Dana Vantrease.
Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.
A Multi-Core Approach to Addressing the Energy-Complexity Problem in Microprocessors Rakesh Kumar Keith Farkas (HP Labs) Norman Jouppi (HP Labs) Partha.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
McRouter: Multicast within a Router for High Performance NoCs
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Defining Anomalous Behavior for Phase Change Memory
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Cores vs. Caches CS 838 Project Matt Ramsay & Chris Feucht.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Lecture 2: Performance Evaluation
Analytic Evaluation of Shared-Memory Systems with ILP Processors
Outline Motivation Project Goals Methodology Preliminary Results
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Cache Memory Presentation I
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Introduction to Multiprocessors
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Exploring Core Designs for Chip Multiprocessors
CMP Design Choices Finding Parameters that Impact CMP Performance
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone

Outline Introduction Assumptions Plackett & Burman Analysis  Simulation methods  Statistical Design  Plackett & Burman Results Mean Value Analysis  MVA Implementation  MVA Results  AMVA Implementation  AMVA Results Complementary Results Conclusions

Introduction 2 part study  Design space is huge, how can we reduce it? Method 1  Plackett & Burman (PB) Analysis finds critical parameters  Design uses extreme values of parameters  Detailed architecture design can focus on a few parameters

Introduction (cont.) Method 2  Mean Value Analysis Model of a CMP  Simply designed to compute throughput  Design choices can be narrowed down quickly  Intuition is gained and patterns/parameter relationships identified

Assumptions - PB Design In-Order approximated as OoO with small window Die Size = 300 mm 2 (16 MB 65nm) L2 Cache Size expanded to fill the die  Discrete sizes: 4, 8, 12 MB  Associativity can be non-power-of-2 Core size measured in Cache Byte Equivalents: PipelineWidthCBE In-Order150 kB In-Order4100 kB Out-of-Order175 kB Out-of-Order4250 kB

Simulation Methodology Simics with Ruby & Opal 16P sims used cache warmup files 2P sims ran for more transactions Attempted OLTP and JBB benchmarks BenchmarkProcessorsTransactions OLTP2200 OLTP16100 JBB JBB

Plackett & Burman Design Motivation  Narrow a huge design space  Minimize simulation runs (experiments) Preliminaries  Performance Measure  Extreme Parameter Values  Number of Parameters (N < 4X n -1)

PB Design Example ABCDEFGTime

PB Design Parameter Values ParameterLow Value (-)High Value (+) Number of Cores216 Pipeline OrganizationIn-OrderOut-of-Order Pipeline Width14 L1 Cache Size16 kB128 kB L1 AssociativityDirect Mapped32-Way L2 Cache SizeDie Area – Core Area L2 AssociativityDirect Mapped32-Way L2 Banks232 L2 Latency50 Cycles12 Cycles L2 Directory Latency25 Cycles6 Cycles Pin Bandwidth Memory Latency300 Cycles100 Cycles

PB Results Extreme Values stressed the simulator Have not completed an entire set of runs, yet Possibly necessary to build a custom L2 network for each run

PB Results for JBB

Assumptions - MVA Distribution of time between memory requests is exponential Processor cores exhibit the same average behavior with respect to their service times and miss rates. Doubling the size of the cache reduces the miss rate by a factor of 1/√2 An inorder core takes approximately the same area as 50 KB of cache

MVA Design Simple Closed Model:

MVA Design Two phases of this Model design  First: Use the exact MVA equations Use average time between memory access as an application parameter Solve for throughput  Second: Use Approximate MVA (AMVA) Use an iterative method to converge on this service time Solve for throughput

Exact MVA To solve for the MVA equations, we determine the mean residence time at all service centers:  R p – processor/L1 residence time  R L2 – L2 residence time  R M – memory residence time. The case with one core is trivial. Use this case to solve for additional cores  R n,p = D p * (1 + Q n-1,p )

Exact MVA results Using data from simulation runs throughput was calculated  Miss rates, number of memory requests Results are erratic Not consistent with simulation results Source of the problem is most likely processor service time!

Approximate MVA Design An iterative method can be used to converge on a service time  Uses total R as an input parameter Iterative method works well with approximate MVA Goal is to match total average residence time of a memory request

Approximate MVA Results Convergence using the AMVA equations does not always occur Total measured residence time cannot be reached with this model and parameter set. Variation of input values without convergence implies flaws in the model structure There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled

Complementary Results Initial goal to produce PB Results to find parameters to focus on for MVA Model Results from both approaches could cross-verify correctness

Conclusions Simics has a STEEP learning curve  <5 weeks is not enough time for valid/any results Refinement of a PB Design leads to long lead times on valid results CMPs complicate the relationship between cores and memory subsystem Design methodologies that focus simulation runs are necessary More results and conclusions to follow

Questions Questions?