The University of Texas at Austin

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Avenues for Research The Microarchitecture of Future Microprocessors.
Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.
Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.
Yale Patt The University of Texas at Austin World University Presidents’ Symposium University of Belgrade April 4, 2009 Future Microprocessors: What must.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter Hardwired vs Microprogrammed Control Multithreading
Mechanisms: Run-time and Compile-time. Outline Agents of Evolution Run-time Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses) Compile-time.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Yale Patt The University of Texas at Austin University of California, Irvine March 2, 2012 High Performance in the Multi-core Era: The role of the Transformation.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
What’s going on here? Can you think of a generic way to describe both of these?
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS 352H: Computer Systems Architecture
COMP 740: Computer Architecture and Implementation
15-740/ Computer Architecture Lecture 4: Pipelining
15-740/ Computer Architecture Lecture 3: Performance
ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance
Computer Architecture Principles Dr. Mike Frank
Multiscalar Processors
CS161 – Design and Architecture of Computer Systems
The University of Adelaide, School of Computer Science
15-740/ Computer Architecture Lecture 7: Pipelining
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Concurrent and Distributed Programming
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Hyperthreading Technology
EE 193: Parallel Computing
Instruction Scheduling for Instruction-Level Parallelism
Introduction, Focus, Overview
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
CMPT 886: Computer Architecture Primer
Lesson Objectives Aims You should be able to:
Address-Value Delta (AVD) Prediction
The University of Texas at Austin
Adaptive Single-Chip Multiprocessing
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
/ Computer Architecture and Design
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Computer Architecture: A Science of Tradeoffs
Latency Tolerance: what to do when it just won’t go away
15-740/ Computer Architecture Lecture 14: Prefetching
Mattan Erez The University of Texas at Austin
Levels of Parallelism within a Single Processor
CSC3050 – Computer Architecture
Introduction, Focus, Overview
CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.
IA-64 Vincent D. Capaccio.
Presentation transcript:

The University of Texas at Austin Computer Architecture: Principles & Tradeoffs (A serious course in fundamental concepts for understanding computer architecture) Yale Patt The University of Texas at Austin EE460N/EE382N.1 August, 2014

Introduction, Focus, Overview

Outline A science of tradeoffs The transformation hierarchy The algorithm, the compiler, the microarchitecture The microarchitecture view The physical view Speculation Intro to Nonsense: Is hardware parallel or sequential Design points Design Principles Role of the Architect Numbers Thinking outside the box

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

ISA (Instruction Set Arch) Problem Algorithm Program ISA (Instruction Set Arch) Microarchitecture Circuits Electrons

The Triangle (originally from George Michael) Only the programmer knows the ALGORITHM Pragmas Pointer chasing Partition code, data Only the COMPILER knows the future (sort of ??) Predication Prefetch/Poststore Block-structured ISA Only the HARDWARE knows the past Branch directions Cache misses Functional unit latency

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

A few more words on Data Supply Memory is particularly troubling Off-chip latency (hundreds of cycles, and getting worse) What can we do about it? Larger caches Better replacement policies Is MLP (Memory level parallelism) the answer Wait for two accesses at the same time Do parallel useful work while waiting (Runahead)

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view (more important in the multicore era) Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

Speculation Why good? – improves performance How? – we guess Starting with the design of ALUs, many years ago! Branch prediction – enables parallelism Way prediction Data prefetching – enables parallelism Value prediction – enables parallelism Address prediction – enables parallelism Memory disambiguation – enables parallelism Why bad? – consumes energyl

Hardware – Sequential or Parallel? Hardware is inherently parallel It has been since time began Then why the sudden interest Useful if we pay attention to it (e.g., factorial) The key idea is Synchronization It can be explicit It can be implicit Pipelining Parallelism at its most basic level Everyone in the world understands that (e.g., factories) Speculation Single thread vs. multiple threads Single core vs. multiple cores

Design Principles Critical path design Bread and Butter design Balanced design

Numbers (because comparch is obsessed with numbers) The Baseline – Make sure it is the best Superlinear speedup Recent example, one core vs. 4 cores with ability to fork The Simulator you use – Is it bug-free? Understanding vs “See, it works!” 16/64 You get to choose your experiments SMT, throughput: run the idle process Combining cores: what should each core look like You get to choose the data you report Wrong path detection: WHEN was the wrong path detected Never gloss over anomalous data

Finally, people are always telling you: Think outside the box

I prefer: Expand the box

Something you are all familiar with: Look-ahead Carry Generators They speed up ADDITION But why do they work?

Addition 12 9 21 182378 645259 827637