Lecture 2: Basic Notions and Fundamentals

Slides:



Advertisements
Similar presentations
More Intel machine language and one more look at other architectures.
Advertisements

Computer Abstractions and Technology
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 23 - Course.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Appendix A Pipelining: Basic and Intermediate Concepts
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
COM181 Computer Hardware Ian McCrumRoom 5B18,
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Computer performance.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
Evolution in Complexity Evolution in Transistor Count.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
1 Embedded Systems Computer Architecture. Embedded Systems2 Memory Hierarchy Registers Cache RAM Disk L2 Cache Speed (faster) Cost (cheaper per-byte)
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
(1) ECE 3056: Architecture, Concurrency and Energy in Computation Lecture Notes by MKP and Sudhakar Yalamanchili Sudhakar Yalamanchili (Some small modifications.
Computer Organization and Design Computer Abstractions and Technology
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
ISCA Panel June 7, 2005 Wen-mei W. Hwu —University of Illinois at Urbana-Champaign 1 Future mass apps reflect a concurrent world u Exciting applications.
Basics of Energy & Power Dissipation
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
EE141 © Digital Integrated Circuits 2nd Introduction 1 Principle of CMOS VLSI Design Introduction Adapted from Digital Integrated, Copyright 2003 Prentice.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Compsci Today’s topics l Operating Systems  Brookshear, Chapter 3  Great Ideas, Chapter 10  Slides from Kevin Wayne’s COS 126 course l Performance.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
PipeliningPipelining Computer Architecture (Fall 2006)
Computer Architecture Furkan Rabee
Computer Organization and Architecture Lecture 1 : Introduction
Lecture 2: Performance Evaluation
15-740/ Computer Architecture Lecture 3: Performance
Microarchitecture.
How do we evaluate computer architectures?
Central Processing Unit Architecture
Computer Design & Organization
COSC 3406: Computer Organization
Architecture & Organization 1
Computer Architecture CSCE 350
Architecture & Organization 1
Chapter 1 Introduction.
Computer Evolution and Performance
COMS 361 Computer Organization
Performance of computer systems
A Level Computer Science Topic 5: Computer Architecture and Assembly
Presentation transcript:

Lecture 2: Basic Notions and Fundamentals © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Outline Basic computer organization Architectural drivers Measures von Neumann model and execution cycle Pipelines and caches Architectural drivers Technology Applications and compatibility Compilers Measures Methodology Key measures © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

von Neumann’s Contribution Instruction Cycle Memory Control Fetch Decode Evaluate addresses Fetch operands Execute Store results program … Datapath Input/ Output © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Pipelining Instruction latency does not decrease Throughput increases Fetch Decode Execute Mem Write Instruction latency does not decrease Throughput increases Dependencies degrade performance General trends: deeper/wider until 2004 © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Multiple Issue Decode/Swap Fetch adder Mem write ALU write ALU write BR Stall logic © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Pipeline of the Pentium IV [Courtesy of Intel Corp] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC nxt IP TC Fetch Drv Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs BrCk Drv Enables very high clock rate Enables scalability from one technology to the next But does it always lead to highest performance? © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Speeding up memory Fundamental: Principles of locality of memory references Reference to X  another reference to X later Reference to X  reference to Y, where X, Y are close Fundamental: Memory tends to be small/fast/expensive or large/slow/cheap Result: Memory hierarchies with caching © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Basic Cache Organization Address tag index offset Direct Mapped Cache tag valid cache line … hit/miss Data © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Cross-section of inverter in CMOS Inverter gate polysilicon gate Vdd gate oxide p-MOS trans field oxide metal Vdd Vss p+ p+ n+ n+ Input Output n well p substrate p-MOS trans n-MOS trans n-MOS trans gate Vss Cross-section of inverter in n-well process Schematic Notation © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends [Collated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size (nanometers) 350nm 180nm 130nm 90nm 65nm 45nm 33nm Transistor Count 10M 50M 110M 350M 1300M 3500M 11000M For high-performance CPUs, the feature size is (typically) the minimum width of a gate n+ n+ © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

An Microarchitect’s Model of CMOS Power and Delay Vdd p-MOS trans Delay is proportional to the number of gates typical measure FO4 Power dissipation dynamic (switching) Static (leakage) Input Output n-MOS trans Vss © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends [Collated/extrapolated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size (nanometers) 350nm 180nm 130nm 90nm 65nm 45nm 33nm Transistor Count 10M 50M 110M 350M 1300M 3500M 11000M Projected FO4 delays In each pipe satge 27 13 11 8 7 6 This will likely be revised due to the transistor variability problem from 65nm generation on. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends Transistors become more plentiful but variable Gates become faster Wires become relatively slower Memory becomes relatively slower Power becomes a critical issue Noise, error rates, design complexity CMOS issues -- Why do wires become slower? Derivative issues Memory – board-level signaling, access time, etc Power – peak power density, energy, dynamic and static -- notion of a limiting factor (peak power prevents integration) Error rates -- small dimensions, close lines © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

RC delay of 1mm interconnect Trends in hardware High variability Increasing speed and power variability of transistors Limited frequency increase Reliability / verification challenges Large interconnect delay Increasing interconnect delay and shrinking clock domains Limited size of individual computing engines 130nm 30% 5X 0.9 1.0 1.1 1.2 1.3 1.4 1 2 3 4 5 Normalized Leakage (Isb) Normalized Frequency Interconnect RC Delay 1 10 100 1000 10000 350 250 180 130 90 65 Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Source: Shekhar Borkar, Intel

Dynamic-Static Interface Moving functionality into compilers Most architectures today, including X86 rely on compilers to achieve performance goals A major issue is the number of bits required to deliver information from the compiler to the runtime hardware Highly optimizing compiler also reduced the incentive for machine language programming, which makes portable programs a reality. The more implementation details one exposes in the instruction set architecture, the more difficult it is to adopt new implementation techniques. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Applications Applications drive architecture from ‘above” Designing next-generation computers involves understanding the behavior of applications of importance and exploiting their characteristics What are applications of importance? For us, we will often use benchmarks to characterize different architectural options © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

In the image of the simple hardware, humans created complex software Software creation based on a simple execution model Instructions are considered to execute sequentially Data objects are mapped into a flat, monolithic store reachable by all Reality when laid out by von Neumann in the 40’s; abstraction now This execution abstraction has been used in development of large, complex software “Traditional software model” © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Future apps reflect a concurrent world Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Physiological simulation – cellular pathways (GE Research) Molecular dynamics simulation (NAMD at UIUC) Video and audio coding and manipulation – MPEG-4 (NCTU) Medical imaging – CT (UIUC) Consumer game and virtual reality products These “Super-applications” represent and model physical world Various granularities of parallelism exist, but… programming model must support required dimensions data delivery needs careful management Do not go over all of the different applications. Let them read them instead. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Direction of computer architecture Current general purpose architectures cover traditional applications New parallel-privatized architectures cover some super-applications Attempts to grow current architectures “out” or domain-specific architectures “in” lack success By properly exploiting parallelism of super-applications, the coverage of domain-specific architectures can be extended This leads to memory wall: Why we can’t transparently extend out the current model to the future application space: memory wall. How does the meaning of the memory wall change as we transition into architectures that target the super-application space. Lessons learned from Itanium © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Compatibility The case of workstations and servers Relatively open to new architectures. Performance is a major concern. Linux/UNIX is a portable operating system. Current economics model works against new architectures The case of personal computers Very tough on new architectures. Windows and Apple OS are not portable Most applications are distributed in binary code Price is of more concern than performance. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Experimental Methodology Selected real programs by characterizing workload PERFECT Club, SPEC, MediaBench What do these programs do? What input was given to these programs? How are they related to your own workload? What do experimental results mean? Require high quality software support. Tremendous variation in capability Trace driven simulation vs. re-compilation Nothing can replace real-machine measurements © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Measures Iron Law: performance = 1 / execution time = 1/ (CPI * insts * 1/freq) (Basis of SPEC Marks) or (IPC * freq)/insts CPI : Cycles per instruction (how is this calculated)? IPC : Instructions per cycle other useful measures : average memory access time © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois