Exploring Core Designs for Chip Multiprocessors

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Performance, Energy and Thermal Considerations of SMT and CMP architectures Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron Dept. of Computer Science,

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

How Multi-threading can increase on-chip parallelism

Application of Instruction Analysis/Synthesis Tools to x86’s Functional Unit Allocation Ing-Jer Huang and Ping-Huei Xie Institute of Computer & Information.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Cores vs. Caches CS 838 Project Matt Ramsay & Chris Feucht.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Computer Science Department In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces Kiyeon Lee and Sangyeun Cho.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su Design Exploration of an Instruction-Based Shared.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

CMP Design Choices Finding Parameters that Impact CMP Performance Sam Koblenski and Peter McClone.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

CS757 Lock Behaviour Characterization of Commercial Workloads Jichuan Chang Xidong Wang.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM

A Case for Toggle-Aware Compression for GPU Systems

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Speculative Lock Elision

ASR: Adaptive Selective Replication for CMP Caches

Multi-core processors

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Memory System Characterization of Commercial Workloads

Hyperthreading Technology

Instruction Scheduling for Instruction-Level Parallelism

Computer Architecture Lecture 4 17th May, 2006

Hardware Multithreading

Yingmin Li Ting Yan Qi Zhao

Presented by: Eric Carty-Fickes

Improving Multiple-CMP Systems with Token Coherence

Sampoorani, Sivakumar and Joshua

Patrick Akl and Andreas Moshovos AENAO Research Group

CMP Design Choices Finding Parameters that Impact CMP Performance

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Sizing Structures Fixed relations Empirical (simulation-based)

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

Exploring Core Designs for Chip Multiprocessors Allison Holloway Matthew Allen

Outline Motivation Hypotheses Methodology Results Conclusions

Motivation What should core of a CMP look like? Workloads: commercial, scientific OOO wide-issue superscalar? Tradeoffs: Performance, Power, Area, Complexity

Hypotheses Commercial workloads will not benefit much from OOO / wide-issue Scientific workloads will benefit significantly from OOO / wide-issue OOO & wide-issue will be less beneficial for larger scale systems Augmenting an in-order processor with non-blocking caches will close OOO gap

Methodology Simulator: Multifacet, Ruby, Opal (OOO) In-order processor model Looked at Simics functional – not comparable Restrict Opal to in-order issue Register renaming not removed Limitations: Can’t recompile code for scheduling Does not model UltraSPARC issue rules

Methodology Workloads Issues Commercial: Apache, SPECjbb, OLTP, Zeus Scientific: Barnes-Hut, Ocean Issues No 4 processor simulation No cache warmup files

Methodology Baseline configuration used ROB, instruction window, and # functional units halved for 2-wide processor

Results OOO vs. in-order provides more performance benefit than widening issue from 2 to 4 Tolerating cache misses is the key

Results Hypothesis 1: Commercial workloads will not benefit much from OOO / wide-issue ~30% speedup Hypothesis 2: Scientific workloads will benefit significantly from OOO / wide-issue ~60% speedup Commercial workloads DO benefit from OOO, but not as much as scientific.

Results OOO & wide-issue will be less beneficial for larger scale systems True, BUT Workloads don’t scale above 8 processors (except apache)

(Non) Results Hypothesis 4: Augmenting an in-order processor with non-blocking caches will close OOO gap Simulations still running!

Future Work Analyze performance trade-offs vs. power? vs. area? 4 processor runs (if possible) Vary # of MSHRs

Conclusions Out-of-order provides substantial benefit over in-order, even for commercial workloads Other methods for tolerating/reducing cache misses may be effective Diminishing returns for larger systems, but workloads don’t scale well Need to consider power and area constraints