CMP Design Choices Finding Parameters that Impact CMP Performance

Slides:



Advertisements
Similar presentations
To Include or Not to Include? Natalie Enright Dana Vantrease.
Advertisements

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Nikos Hardavellas, Northwestern University
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Radial Basis Function Networks
McRouter: Multicast within a Router for High Performance NoCs
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Defining Anomalous Behavior for Phase Change Memory
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Cores vs. Caches CS 838 Project Matt Ramsay & Chris Feucht.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
MIPS Project -- Simics Yang Diyi Outline Introduction to Simics Simics Installation – Linux – Windows Guide to Labs – General idea Score Policy.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.
CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access Laurence S.Kaplan BBN Advanced Computers Inc. Cambridge,MA Distributed.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Architectural Features of Transactional Memory Designs for an Operating System Chris Rossbach, Hany Ramadan, Don Porter Advanced Computer Architecture.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Lecture 2: Performance Evaluation
CPE 619 Mean-Value Analysis
Deep Feedforward Networks
Analytic Evaluation of Shared-Memory Systems with ILP Processors
Improving Memory Access 1/3 The Cache and Virtual Memory
Basic Performance Parameters in Computer Architecture:
Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs
Outline Motivation Project Goals Methodology Preliminary Results
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Cache Memory Presentation I
Lecture: Large Caches, Virtual Memory
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Introduction to Multiprocessors
Improving Multiple-CMP Systems with Token Coherence
DDM – A Cache-Only Memory Architecture
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Exploring Core Designs for Chip Multiprocessors
Lei Zhao, Youtao Zhang, Jun Yang
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone

Outline Introduction Assumptions Plackett & Burman Analysis Simulation methods Statistical Design Plackett & Burman Results Mean Value Analysis MVA Implementation MVA Results AMVA Implementation AMVA Results Complementary Results Conclusions

Introduction 2 part study Method 1 Design space is huge, how can we reduce it? Method 1 Plackett & Burman (PB) Analysis finds critical parameters Design uses extreme values of parameters Detailed architecture design can focus on a few parameters

Introduction (cont.) Method 2 Mean Value Analysis Model of a CMP Simply designed to compute throughput Design choices can be narrowed down quickly Intuition is gained and patterns/parameter relationships identified

Assumptions - PB Design In-Order approximated as OoO with small window Die Size = 300 mm2 (16 MB Cache @ 65nm) L2 Cache Size expanded to fill the die Discrete sizes: 4, 8, 12 MB Associativity can be non-power-of-2 Core size measured in Cache Byte Equivalents: Pipeline Width CBE In-Order 1 50 kB 4 100 kB Out-of-Order 75 kB 250 kB

Simulation Methodology Simics with Ruby & Opal 16P sims used cache warmup files 2P sims ran for more transactions Attempted OLTP and JBB benchmarks Benchmark Processors Transactions OLTP 2 200 16 100 JBB 20000 10000

Plackett & Burman Design Motivation Narrow a huge design space Minimize simulation runs (experiments) Preliminaries Performance Measure Extreme Parameter Values Number of Parameters (N < 4Xn-1)

PB Design Example A B C D E F G Time + - 9 11 2 1 74 7 4 17 76 6 31 19 33 112 191 111 -13 79 55 239

PB Design Parameter Values Low Value (-) High Value (+) Number of Cores 2 16 Pipeline Organization In-Order Out-of-Order Pipeline Width 1 4 L1 Cache Size 16 kB 128 kB L1 Associativity Direct Mapped 32-Way L2 Cache Size Die Area – Core Area L2 Associativity L2 Banks 32 L2 Latency 50 Cycles 12 Cycles L2 Directory Latency 25 Cycles 6 Cycles Pin Bandwidth 400 10000 Memory Latency 300 Cycles 100 Cycles

PB Results Extreme Values stressed the simulator Have not completed an entire set of runs, yet Possibly necessary to build a custom L2 network for each run

PB Results for JBB

Assumptions - MVA Distribution of time between memory requests is exponential Processor cores exhibit the same average behavior with respect to their service times and miss rates. Doubling the size of the cache reduces the miss rate by a factor of 1/√2 An inorder core takes approximately the same area as 50 KB of cache

MVA Design Simple Closed Model:

MVA Design Two phases of this Model design First: Use the exact MVA equations Use average time between memory access as an application parameter Solve for throughput Second: Use Approximate MVA (AMVA) Use an iterative method to converge on this service time Solve for throughput 

Exact MVA To solve for the MVA equations, we determine the mean residence time at all service centers: Rp – processor/L1 residence time RL2 – L2 residence time RM – memory residence time. The case with one core is trivial. Use this case to solve for additional cores Rn,p = Dp * (1 + Qn-1,p)

Exact MVA results Using data from simulation runs throughput was calculated Miss rates, number of memory requests Results are erratic Not consistent with simulation results Source of the problem is most likely processor service time!

Approximate MVA Design An iterative method can be used to converge on a service time Uses total R as an input parameter Iterative method works well with approximate MVA Goal is to match total average residence time of a memory request

Approximate MVA Results Convergence using the AMVA equations does not always occur Total measured residence time cannot be reached with this model and parameter set. Variation of input values without convergence implies flaws in the model structure There is a complex relationship between the memory system and the rate at which a core issues requests that must be modeled 

Complementary Results Initial goal to produce PB Results to find parameters to focus on for MVA Model Results from both approaches could cross-verify correctness

Conclusions Simics has a STEEP learning curve <5 weeks is not enough time for valid/any results Refinement of a PB Design leads to long lead times on valid results CMPs complicate the relationship between cores and memory subsystem Design methodologies that focus simulation runs are necessary More results and conclusions to follow

Questions Questions?