Introduction, Focus, Overview

Slides:

Advertisements

Similar presentations

Avenues for Research The Microarchitecture of Future Microprocessors.

Advertisements

Microprocessor Performance, Phase II Yale Patt The University of Texas at Austin STAMATIS Symposium TU-Delft September 28, 2007.

Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.

Yale Patt The University of Texas at Austin World University Presidents’ Symposium University of Belgrade April 4, 2009 Future Microprocessors: What must.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter Hardwired vs Microprogrammed Control Multithreading

Mechanisms: Run-time and Compile-time. Outline Agents of Evolution Run-time Branch prediction Trace Cache SMT, SSMT The memory problem (L2 misses) Compile-time.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.

Yale Patt The University of Texas at Austin University of California, Irvine March 2, 2012 High Performance in the Multi-core Era: The role of the Transformation.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Speculative execution Landon Cox April 13, Making disk accesses tolerable Basic idea Remove disk accesses from critical path Transform disk latencies.

What’s going on here? Can you think of a generic way to describe both of these?

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS 352H: Computer Systems Architecture

COMP 740: Computer Architecture and Implementation

Computer Architecture Chapter (14): Processor Structure and Function

15-740/ Computer Architecture Lecture 4: Pipelining

15-740/ Computer Architecture Lecture 3: Performance

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Computer Architecture Principles Dr. Mike Frank

Design-Space Exploration

Multiscalar Processors

Simultaneous Multithreading

CS161 – Design and Architecture of Computer Systems

The University of Adelaide, School of Computer Science

15-740/ Computer Architecture Lecture 7: Pipelining

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers

Concurrent and Distributed Programming

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Cache Memory Presentation I

Hyperthreading Technology

EE 193: Parallel Computing

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Instruction Scheduling for Instruction-Level Parallelism

The University of Texas at Austin

Evolution in Memory Management Techniques

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

CMPT 886: Computer Architecture Primer

Address-Value Delta (AVD) Prediction

Adaptive Single-Chip Multiprocessing

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Computer Architecture: A Science of Tradeoffs

Speculative execution and storage

Latency Tolerance: what to do when it just won’t go away

15-740/ Computer Architecture Lecture 14: Prefetching

Mattan Erez The University of Texas at Austin

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

Introduction, Focus, Overview

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

IA-64 Vincent D. Capaccio.

Presentation transcript:

Introduction, Focus, Overview

Outline A science of tradeoffs The transformation hierarchy The algorithm, the compiler, the microarchitecture The microarchitecture view The physical view Speculation Intro to Nonsense: Is hardware parallel or sequential Design points Design Principles Role of the Architect Numbers Thinking outside the box

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

ISA (Instruction Set Arch) Problem Algorithm Program ISA (Instruction Set Arch) Microarchitecture Circuits Electrons

The Triangle (originally from George Michael) Only the programmer knows the ALGORITHM Pragmas Pointer chasing Partition code, data Only the COMPILER knows the future (sort of ??) Predication Prefetch/Poststore Block-structured ISA Only the HARDWARE knows the past Branch directions Cache misses Functional unit latency

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

A few more words on Data Supply Memory is particularly troubling Off-chip latency (hundreds of cycles, and getting worse) What can we do about it? Larger caches Better replacement policies Is MLP (Memory level parallelism) the answer Wait for two accesses at the same time Do parallel useful work while waiting (Runahead)

Microarchitecture view Trade-offs, the overriding consideration: What is the cost? What is the benefit? Global view Global vs. Local transformations Microarchitecture view The three ingredients to performance Physical view (more important in the multicore era) Wire delay (recently relevant) – Why? (frequency) Bandwidth (recently relevant) – Why? (multiple cores) Power, energy (recently relevant) – Why? (cores, freq) Soft errors (recently relevant) – Why? (freq) Partitioning (since the beginning of time)

Speculation Why good? – improves performance How? – we guess Starting with the design of ALUs, many years ago! Branch prediction – enables parallelism Way prediction Data prefetching – enables parallelism Value prediction – enables parallelism Address prediction – enables parallelism Memory disambiguation – enables parallelism Why bad? – consumes energyl

Hardware – Sequential or Parallel? Hardware is inherently parallel It has been since time began Then why the sudden interest Useful if we pay attention to it (e.g., factorial) The key idea is Synchronization It can be explicit It can be implicit Pipelining Parallelism at its most basic level Everyone in the world understands that (e.g., factories) Speculation Single thread vs. multiple threads Single core vs. multiple cores

Design Principles Critical path design Bread and Butter design Balanced design

Numbers (because comparch is obsessed with numbers) The Baseline – Make sure it is the best Superlinear speedup Recent example, one core vs. 4 cores with ability to fork The Simulator you use – Is it bug-free? Understanding vs “See, it works!” 16/64 You get to choose your experiments SMT, throughput: run the idle process Combining cores: what should each core look like You get to choose the data you report Wrong path detection: WHEN was the wrong path detected Never gloss over anomalous data

Finally, people are always telling you: Think outside the box

I prefer: Expand the box

Something you are all familiar with: Look-ahead Carry Generators They speed up ADDITION But why do they work?

Addition 12 9 21 182378 645259 827637