15-745 Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
Montek Singh COMP Oct 11,  Basics of multicore systems ◦ what is multicore? ◦ why multicore? ◦ main features ◦ examples  Next two classes.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002.
Multiscalar processors
12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.
Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
INTRODUCTION TO MULTISCALAR ARCHITECTURE
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
15-740/ Computer Architecture Lecture 3: Performance
Computer Architecture Principles Dr. Mike Frank
Multiscalar Processors
The University of Adelaide, School of Computer Science
Cache Memory Presentation I
Swanson et al. Presented by Andrew Waterman ECE259 Spring 2008
Software Cache Coherent Control by Parallelizing Compiler
Lecture 14: Reducing Cache Misses
Interconnect with Cache Coherency Manager
Henk Corporaal TUEindhoven 2011
Mattan Erez The University of Texas at Austin
Instruction Level Parallelism (ILP)
WaveScalar: the Executive Summary
CS 3410, Spring 2014 Computer Science Cornell University
Mattan Erez The University of Texas at Austin
Research: Past, Present and Future
Presentation transcript:

Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer

Spring 2006Slide 2Wavescalar ILP in Modern Architecture Lots of available ILP in software –Execute in parallel for greater performance Superscalar processors can’t tap it –Serialized by PC Superscalar doesn’t scale Data-flow approaches can cheaply leverage existing parallelism

Spring 2006Slide 3Wavescalar Introduction WaveCache and Wavescalar ISA Evaluation and Results Does WaveCache make sense? Compiler challenges

Spring 2006Slide 4Wavescalar Wavescalar: Basics ALU-in-cache data-flow architecture –No centralized, broadcast-based resources Compile data-flow binaries

Spring 2006Slide 5Wavescalar Wavescalar: Waves Instructions  architecture Programs broken into waves –Block with single entry Use wave number to tag data –Disambiguates data from multiple iterations

Spring 2006Slide 6Wavescalar Wavescalar: Memory Relaxed program order –Follow control-flow –Obey dependencies Distributed store buffers Hardware coherence

Spring 2006Slide 7Wavescalar Evaluation WaveCache –4 MB of on-chip instructions + data, 2K ALUs WaveCache vs. superscalar –16-wide OOO, 1K registers, 1K window WaveCache vs. TRIPS –4 16-wide in-order cores, 2 MB on-chip cache Key assumption: perfect memory Fair comparisons? Is it reasonable to assume perfect memory?

Spring 2006Slide 8Wavescalar Results WaveCache out- performs superscalar Similar performance to TRIPS

Spring 2006Slide 9Wavescalar Memory is the problem, not ILP Data-flow exposes greater ILP Memory not fast enough for low-ILP CPUs –Processor-memory performance gap What does perfect memory hide? –Does superscalar perform better? Did not model hardware coherence WaveCache needs MORE bandwidth than a superscalar

Spring 2006Slide 10Wavescalar Is WaveScalar Scalable? Sub-linear performance improvement –More clusters further away from memory SPEC, MediaBench fit easily in memory What happens to performance when the working set doesn’t fit in WaveCache?

Spring 2006Slide 11Wavescalar Compiler Challenges Wave identification –Can waves be optimized for performance? Handling path explosion –1 BR/5 inst  1050 loaded for 100 executed?

Spring 2006Slide 12Wavescalar Compiler Challenges Semi-static instruction placement –Fetch partial/complete waves –Loads/stores close to memory –Clustering neighboring instructions –Reduce coherence traffic