ECE 4100/ Advanced Computer Architecture Sudhakar Yalamanchili

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Instruction Level Parallelism ILP

Fall 2012SYSC 5704: Elements of Computer Systems 1 MicroArchitecture Murdocca, Chapter 5 (selected parts) How to read Chapter 5.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

©UCB CS 162 Computer Architecture Lecture 1 Instructor: L.N. Bhuyan

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.

RISC CSS 548 Joshua Lo.

COM181 Computer Hardware Ian McCrumRoom 5B18,

Basics and Architectures

Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

(1) ECE 3056: Architecture, Concurrency and Energy in Computation Lecture Notes by MKP and Sudhakar Yalamanchili Sudhakar Yalamanchili (Some small modifications.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

RISC and CISC. What is CISC? CISC is an acronym for Complex Instruction Set Computer and are chips that are easy to program and which make efficient use.

1 Processor Architecture Jurij Silc, Borut Robic, Theo Ungerer.

Processor Architecture

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Pipelining and Parallelism Mark Staveley

Spring 2016, Jan 13 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

PipeliningPipelining Computer Architecture (Fall 2006)

Use of Pipelining to Achieve CPI < 1

Chapter Overview General Concepts IA-32 Processor Architecture

Computer Organization and Architecture Lecture 1 : Introduction

CS 352H: Computer Systems Architecture

Topics to be covered Instruction Execution Characteristics

Advanced Architectures

ECE 3055: Computer Architecture and Operating Systems

Computer Architecture Principles Dr. Mike Frank

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Assembly Language for Intel-Based Computers, 5th Edition

课程名编译原理 Compiling Techniques

INTRODUCTION TO MICROPROCESSORS

Scalable Processor Design

/ Computer Architecture and Design

INTRODUCTION TO MICROPROCESSORS

Hyperthreading Technology

Pipelining: Advanced ILP

Instruction Scheduling for Instruction-Level Parallelism

Superscalar Processors & VLIW Processors

Central Processing Unit

CISC AND RISC SYSTEM Based on instruction set, we broadly classify Computer/microprocessor/microcontroller into CISC and RISC. CISC SYSTEM: COMPLEX INSTRUCTION.

T Computer Architecture, Autumn 2005

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Control unit extension for data hazards

Chapter 1 Introduction.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Chapter 11: Alternative Architectures

ECE 8823: GPU Architectures

Overview Prof. Eric Rotenberg

Superscalar and VLIW Architectures

Chapter 12 Pipelining and RISC

Control unit extension for data hazards

Course Outline for Computer Architecture

Control unit extension for data hazards

Design of Digital Circuits Lecture 19a: VLIW

ELEC / Computer Architecture and Design Fall 2014 Introduction

Presentation transcript:

ECE 4100/6100 Advanced Computer Architecture Sudhakar Yalamanchili sudha@ece.gatech.edu ECE 4100/6100 Fall 2006 © Bhagirath Narahari, Krishna V. Palem, Sudhakar Yalamanchili, and Weng Fai Wong

Semiconductor Technology Trends Source Intel : http://www.intel.com/research/silicon/mooreslaw.htm Fall 2006

Course Rationale and Coverage Application Instruction Level Parallelism Compiler Chapters 3 & 4, Appendix A Processor Core Primary Cache ICache Secondary Cache Main Memory Storage Hierarchy Interconnection network Network Interface Chapters 5 & 7 Disk Disk Disk Disk Networking and Parallelism Chapters 6 & 8 Fall 2006

Course Outline Landscape Execution Core Memory Un-Core Fall 2006 Module Topic 1 Overview and ILP Taxonomy 2 Overview: Optimizing Compilers 3 The Core: Hardware Scheduling 4 Branch Prediction 5 Speculative Execution 6 Exceptions & Performance Evaluation 7 Instruction Scheduling 8 Memory Hierarchy: Register Allocation 9 Caches 10 Memory 11 Coherency and Consistency 12 Storage Systems 13 Interconnection Networks 14 Multicore and Other Case Studies Landscape Execution Core Memory Un-Core Fall 2006

Course Administration Text: Computer Architecture: A Quantitative Approach, by Hennessy and Patterson (third edition, 2003). Evaluation Two midterm exams : 15% each, September 22nd, and November 3rd Final exam: 30% (12/14/2006, 3rd, period) Assignments (Individual): 40% Satisfactory completion of all laboratory projects is mandatory !! Class URL http://users.ece.gatech.edu/~sudha/academic/class/ece4100-6100/Fall2006/ Lecture notes, readings, course material Fall 2006

Taxonomy of Computer Architecture: Terminology and Background The slides used for Lecture 1 were contributed by Prof. B. Narahari, George Washington University Fall 2006

Instruction Level Parallel Processors Early ILP - one of two orthogonal concepts: pipelining multiple (non-pipelined) units Progression to multiple pipelined units Instruction issue became bottleneck, led to superscalar ILP processors Very Large Instruction Word (VLIW) Fall 2006

ILP Processors Whereas pipelined processors work like an assembly line VLIW and Superscalar processors operate basically in parallel, making use of a number of concurrently working execution units (EU) Source: “The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology,” Vol. 8 Issue 1 (February 2004) Fall 2006

ILP Processors Parallelism Pipelining (Vertical) Superscalar, EPIC(VLIW) (Horizontal) Fall 2006

Topic of future lectures Outline Introduction to Instruction-level parallelism (ILP) History of ILP ILP Architectures Compiler Techniques for ILP Hardware Support of ILP Topic of future lectures Fall 2006

Introduction to ILP What is ILP? ILP is transparent to the user Processor and Compiler design techniques that speed up execution by causing individual machine operations to execute in parallel ILP is transparent to the user Multiple operations executed in parallel even though the system is handed a single program written with a sequential processor in mind Same execution hardware as a normal RISC machine May be more than one of any given type of hardware Fall 2006

Example Execution Fall 2006

Example Execution Sequential Execution ILP Execution Fall 2006

Early History of ILP 1940s and 1950s 1960s - Transistorized computers Parallelism first exploited in the form of horizontal microcode Wilkes and Stringer, 1953 - “In some cases it may be possible for two or more micro-operations to take place at the same time” 1960s - Transistorized computers More gates available than necessary for a general-purpose CPU ILP provided at machine-language level Fall 2006

Early History of ILP 1963 - Control Data Corporation delivered the CDC 6600 10 functional units Any unit could begin execution in a given cycle even if other units were still processing data-independent earlier operations 1967 - IBM delivered the 360/91 Fewer functional units than the CDC 6600 Far more aggressive in its attempts to rearrange the instruction stream to keep functional units busy Fall 2006

B. Ramakrishna Rau and Joseph A. Fisher “Instruction-Level Parallel Processing: History, Overview and Perspective:” B. Ramakrishna Rau and Joseph A. Fisher October 1992 http://www.hpl.hp.com/techreports/92/HPL-92-132.pdf ECE 4100/6100 Fall 2006

Additional Reference “Instruction-Level Parallel Processing” Joseph A. Fisher and B. Ramakrishna Rau January 1992 ECE 4100/6100 Fall 2006

Recent History of ILP 1970s - Specialized Signal Processing Computers Horizontally microcoded FFTs and other algorithms 1980s - Speed Gap between writeable and read-only memory narrows Advantages of read-only control store began to disappear General purpose microprocessors moved toward RISC concept. Specialized processors provided writeable control memory, giving users access to ILP called Very Long Instruction Word (VLIW) Fall 2006

Recent History of ILP 1990s - More silicon than necessary for implementation of a RISC microprocessor Virtually all designs take advantage of the available real estate by providing some form of ILP Primarily in the form of superscalar capability Some have used VLIWs as well Emergence of ILP saturation leading to thread level parallelism The memory wall! 2000s - ? Hitting the frequency wall The advent of multi-core designs for thread level parallelism Fall 2006

Questions Facing ILP System Designers What gives rise to instruction-level parallelism in conventional, sequential programs and how much of it is there? How is the potential parallelism identified and enhanced? What must be done in order to exploit the parallelism that has been identified? How should the work of identifying, enhancing and exploiting the parallelism be divided between the hardware and the compiler? What are the alternatives in selecting the architecture of an ILP processor? Fall 2006

ILP Architectures Between the compiler and the run-time hardware, the following functions must be performed Dependencies between operations must be determined Operations that are independent of any operation that has not yet completed must be determined Independent operations must be scheduled to execute at some particular time, on some specific functional unit, and must be assigned a register into which the result may be deposited. LD R4, 32(R3) .. DADD R5, R4, R2 Program order Data dependency Fall 2006

ILP Architecture Classifications Sequential Architectures The program is not expected to convey any explicit information regarding parallelism Dependence Architectures The program explicitly indicates dependencies between operations Independence Architectures The program provides information as to which operations are independent of one another Fall 2006

The Processor Datapath Performance Goal: Exploit instruction-level parallelism to execute multiple instructions per cycle superscalar VLIW EPIC Multi-core Multi-cycle datapaths Pipelining Instruction level parallelism CPI = 1.0 CPI < 1 CPI > 1 IPC in the hundreds to thousands Power! 1980’s 1990’s 2000’s - ? Architectural and compiler techniques to systematically increase the number of instructions executed per clock cycle Fall 2006

Sequential Architecture Program contains no explicit information regarding dependencies that exist between instructions Dependencies between instructions must be determined by the hardware It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed Compiler may re-order instructions to facilitate the hardware’s task of extracting parallelism Fall 2006

Sequential Architecture Example Superscalar processor is a representative ILP implementation of a sequential architecture For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the operands of any other instruction that is either already in execution have been issued but are waiting for the completion of interfering instructions that would have been executed earlier in a sequential program is being issued concurrently but would have been executed earlier in the sequential execution of the program Fall 2006

Sequential Architecture Example Superscalar processors attempt to issue multiple instructions per cycle However, essential dependencies are specified by sequential ordering so operations must be processed in sequential order This proves to be a performance bottleneck that is very expensive to overcome Alternative to multiple instructions per cycle is pipelining and issue instructions faster Fall 2006

Dependence Architecture Compiler or programmer communicates to the hardware the dependencies between instructions removes the need to scan the program in sequential order (the bottleneck for superscalar processors) Hardware determines at run-time when to schedule the instruction Fall 2006

Dependence Architecture Example Dataflow processors are representative of Dependence architectures Execute instruction at earliest possible time subject to availability of input operands and functional units Dependencies communicated by providing with each instruction an encoding of all successor instructions As soon as all input operands of an instruction are available, the hardware fetches the instruction The instruction is executed as soon as a functional unit is available Few pure dataflow processors currently exist Dataflow principles are employed in several embedded signal processor architectures Fall 2006

Independence Architecture By knowing which operations are independent, the hardware needs no further checking to determine which instructions can be issued in the same cycle The set of independent operations is far greater than the set of dependent operations Only a subset of independent operations are specified The compiler may additionally specify on which functional unit and in which cycle an operation is executed The hardware needs to make no run-time decisions Fall 2006

Independence Architecture Example VLIW processors are examples of Independence architectures Specify exactly which functional unit each operation is executed on and when each operation is issued Operations are independent of other operations issued at the same time as well as those that are in execution Compiler emulates at compile time what a dataflow processor does at run-time Fall 2006

ILP Architecture Comparison Fall 2006

Summary Compiler Hardware Front-End & Optimizer Sequential (superscalar) Determine Dependences Determine Dependences Dependence Architecture (dataflow) Determine Independences Determine Independences Independence Architecture Bind Resources Bind Resources Independence Architecture (VLIW) Execute Compiler Hardware Fall 2006