Lecture 2: Basic Notions and Fundamentals

Name: Lecture 2: Basic Notions and Fundamentals
Uploaded: 2017-12-16T05:24:35+00:00
Duration: PTM17S34
Channel: Lawrence Hutchinson
Description: Lecture 2: Basic Notions and Fundamentals

Lecture 2: Basic Notions and Fundamentals
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Outline Basic computer organization Architectural drivers Measures
von Neumann model and execution cycle Pipelines and caches Architectural drivers Technology Applications and compatibility Compilers Measures Methodology Key measures © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

von Neumann’s Contribution
Instruction Cycle Memory Control Fetch Decode Evaluate addresses Fetch operands Execute Store results program … Datapath Input/ Output © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Pipelining Instruction latency does not decrease Throughput increases
Fetch Decode Execute Mem Write Instruction latency does not decrease Throughput increases Dependencies degrade performance General trends: deeper/wider until 2004 © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Multiple Issue Decode/Swap Fetch adder Mem write ALU write ALU write
BR Stall logic © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Pipeline of the Pentium IV [Courtesy of Intel Corp]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 TC nxt IP TC Fetch Drv Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs BrCk Drv Enables very high clock rate Enables scalability from one technology to the next But does it always lead to highest performance? © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Speeding up memory Fundamental: Principles of locality of memory references Reference to X  another reference to X later Reference to X  reference to Y, where X, Y are close Fundamental: Memory tends to be small/fast/expensive or large/slow/cheap Result: Memory hierarchies with caching © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Basic Cache Organization
Address tag index offset Direct Mapped Cache tag valid cache line … hit/miss Data © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Cross-section of inverter in
CMOS Inverter gate polysilicon gate Vdd gate oxide p-MOS trans field oxide metal Vdd Vss p+ p+ n+ n+ Input Output n well p substrate p-MOS trans n-MOS trans n-MOS trans gate Vss Cross-section of inverter in n-well process Schematic Notation © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends [Collated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size (nanometers) 350nm 180nm 130nm 90nm 65nm 45nm 33nm Transistor Count 10M 50M 110M 350M 1300M 3500M 11000M For high-performance CPUs, the feature size is (typically) the minimum width of a gate n+ n+ © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

An Microarchitect’s Model of CMOS Power and Delay
Vdd p-MOS trans Delay is proportional to the number of gates typical measure FO4 Power dissipation dynamic (switching) Static (leakage) Input Output n-MOS trans Vss © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends [Collated/extrapolated from the 2000 update of the International Technology Roadmap for Semiconductors] Year 1995 1998 2001 2004 2007 2010 2013 Feature Size (nanometers) 350nm 180nm 130nm 90nm 65nm 45nm 33nm Transistor Count 10M 50M 110M 350M 1300M 3500M 11000M Projected FO4 delays In each pipe satge 27 13 11 8 7 6 This will likely be revised due to the transistor variability problem from 65nm generation on. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

CMOS Trends Transistors become more plentiful but variable
Gates become faster Wires become relatively slower Memory becomes relatively slower Power becomes a critical issue Noise, error rates, design complexity CMOS issues -- Why do wires become slower? Derivative issues Memory – board-level signaling, access time, etc Power – peak power density, energy, dynamic and static -- notion of a limiting factor (peak power prevents integration) Error rates -- small dimensions, close lines © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

RC delay of 1mm interconnect
Trends in hardware High variability Increasing speed and power variability of transistors Limited frequency increase Reliability / verification challenges Large interconnect delay Increasing interconnect delay and shrinking clock domains Limited size of individual computing engines 130nm 30% 5X 0.9 1.0 1.1 1.2 1.3 1.4 1 2 3 4 5 Normalized Leakage (Isb) Normalized Frequency Interconnect RC Delay 1 10 100 1000 10000 350 250 180 130 90 65 Delay (ps) Clock Period RC delay of 1mm interconnect Copper Interconnect © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Source: Shekhar Borkar, Intel

Dynamic-Static Interface
Moving functionality into compilers Most architectures today, including X86 rely on compilers to achieve performance goals A major issue is the number of bits required to deliver information from the compiler to the runtime hardware Highly optimizing compiler also reduced the incentive for machine language programming, which makes portable programs a reality. The more implementation details one exposes in the instruction set architecture, the more difficult it is to adopt new implementation techniques. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Applications Applications drive architecture from ‘above”
Designing next-generation computers involves understanding the behavior of applications of importance and exploiting their characteristics What are applications of importance? For us, we will often use benchmarks to characterize different architectural options © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

In the image of the simple hardware, humans created complex software
Software creation based on a simple execution model Instructions are considered to execute sequentially Data objects are mapped into a flat, monolithic store reachable by all Reality when laid out by von Neumann in the 40’s; abstraction now This execution abstraction has been used in development of large, complex software “Traditional software model” © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Future apps reflect a concurrent world
Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” Physiological simulation – cellular pathways (GE Research) Molecular dynamics simulation (NAMD at UIUC) Video and audio coding and manipulation – MPEG-4 (NCTU) Medical imaging – CT (UIUC) Consumer game and virtual reality products These “Super-applications” represent and model physical world Various granularities of parallelism exist, but… programming model must support required dimensions data delivery needs careful management Do not go over all of the different applications. Let them read them instead. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Direction of computer architecture
Current general purpose architectures cover traditional applications New parallel-privatized architectures cover some super-applications Attempts to grow current architectures “out” or domain-specific architectures “in” lack success By properly exploiting parallelism of super-applications, the coverage of domain-specific architectures can be extended This leads to memory wall: Why we can’t transparently extend out the current model to the future application space: memory wall. How does the meaning of the memory wall change as we transition into architectures that target the super-application space. Lessons learned from Itanium © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Compatibility The case of workstations and servers
Relatively open to new architectures. Performance is a major concern. Linux/UNIX is a portable operating system. Current economics model works against new architectures The case of personal computers Very tough on new architectures. Windows and Apple OS are not portable Most applications are distributed in binary code Price is of more concern than performance. © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Experimental Methodology
Selected real programs by characterizing workload PERFECT Club, SPEC, MediaBench What do these programs do? What input was given to these programs? How are they related to your own workload? What do experimental results mean? Require high quality software support. Tremendous variation in capability Trace driven simulation vs. re-compilation Nothing can replace real-machine measurements © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Measures Iron Law: performance = 1 / execution time
= 1/ (CPI * insts * 1/freq) (Basis of SPEC Marks) or (IPC * freq)/insts CPI : Cycles per instruction (how is this calculated)? IPC : Instructions per cycle other useful measures : average memory access time © Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

Lecture 2: Basic Notions and Fundamentals

Similar presentations

Presentation on theme: "Lecture 2: Basic Notions and Fundamentals"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 2: Basic Notions and Fundamentals

Similar presentations

Presentation on theme: "Lecture 2: Basic Notions and Fundamentals"— Presentation transcript:

Similar presentations

About project

Feedback