1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang.

Slides:



Advertisements
Similar presentations
Computer Abstractions and Technology
Advertisements

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
1 CSE SUNY New Paltz Chapter 1 Introduction CSE-45432Introduction to Computer Architecture Dr. Izadi.
1 CS 501 Spring 2005 CS 501: Software Engineering Lecture 22 Performance of Computer Systems.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
5.1 Chaper 4 Central Processing Unit Foundations of Computer Science  Cengage Learning.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Computer performance.
Lecture 2: Technology Trends and Performance Evaluation Performance definition, benchmark, summarizing performance, Amdahl’s law, and CPI.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Where Has This Performance Improvement Come From? Technology –More transistors per chip –Faster logic Machine Organization/Implementation –Deeper pipelines.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CS3350B Computer Architecture Winter 2015 Performance Metrics I Marc Moreno Maza
Multi-core architectures. Single-core computer Single-core CPU chip.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
PerformanceCS510 Computer ArchitecturesLecture Lecture 3 Benchmarks and Performance Metrics Lecture 3 Benchmarks and Performance Metrics.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Computer Architecture
Computer Organization & Assembly Language © by DR. M. Amer.
Computer Architecture CPSC 350
CS252/Patterson Lec 1.1 1/17/01 CMPUT429/CMPE382 Winter 2001 Topic2: Technology Trend and Cost/Performance (Adapted from David A. Patterson’s CS252 lecture.
Cost and Performance.
Morgan Kaufmann Publishers
Performance Performance
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
6.1 Advanced Operating Systems Lies, Damn Lies and Benchmarks Are your benchmark tests reliable?
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CS203 – Advanced Computer Architecture Performance Evaluation.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
CC410: System Programming Dr. Manal Helal – Fall 2014 – Lecture 3.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
William Stallings Computer Organization and Architecture 6th Edition
CS203 – Advanced Computer Architecture
Lecture 2: Performance Evaluation
4- Performance Analysis of Parallel Programs
September 2 Performance Read 3.1 through 3.4 for Tuesday
How do we evaluate computer architectures?
Morgan Kaufmann Publishers
Architecture & Organization 1
CSCE 212 Chapter 4: Assessing and Understanding Performance
Computer Architecture CSCE 350
CS775: Computer Architecture
Architecture & Organization 1
Performance of computer systems
Performance of computer systems
Computer Evolution and Performance
Chapter 4 Multiprocessors
Performance of computer systems
Computer Architecture
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

1 CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang

2 Computer Architecture Trends

3 Figure 1.1 H&P Growth in microprocessor 35% per year

4 Technology Trends Smaller feature sizes – higher speed, density Density is increased by 77 times

5 Technology Trends Larger chips –Trend is toward more RAM, less logic per chip –Historically 2x per generation; leveling off? –McKinley has large on-chip caches => larger wafers to reduce fabricate costs

6 Moore’s Law Number of transistors doubles every 18 months (amended to 24 months) Combination of both greater density and larger chips

7 Tech. Trends, contd. More, faster, cheaper transistors have fed an application demand for higher performance –1970s -- serial, 1-bit integer microprocessors –1980s -- pipelined 32-bit RISC ISA simplicity allows processor on chip –1990s -- large, superscalar processors, even for CISC –2000s -- multiprocessors on a chip

8 Pipelining and Branch Prediction Two basic ways of increasing performance Pipelining: Branch Prediction –Speculate on branch outcome to avoid waiting MEWBEXIFID latch clock

9 Tech. Trend: memory sizes Memories have grown very dense –Feeding application demand for large, complex software

10 Tech. Trend: memory speeds Main memory speeds have not kept up with processor speeds

11 Memory Hierarchies Gap between processor and memory performance has led to widespread use of memory hierarchies –1960s no caches, no virtual memory –1970s shared I & D-cache, 32-bit virtual memory –1980s Split I- and D-caches –1990s Two level caches, 64-bit virtual memory –2000s Multi-level caches, both on and off-chip

12 Memory Hierarchies MEMORY SYSTEM PROCESSOR Main Memory L2 Cache Registers L1 Cache L3 Cache Small/Fast Large/Slow

13 I/O a key system component I/O has evolved into a major distinguishing feature of computer systems –1960s: disk, tape, punch cards, tty; batch processing –1970s: character oriented displays –1980s: video displays, audio, increasing disk sizes, beginning networking –1990s: 3D graphics; networking a fundamental element; high quality audio –2000s: real-time video, immersion…

14 I/O Systems A hierarchy that divides bandwidth CD ROM Hard Drives Controller Buffer Frame Monitor Expansion Floppy interface DRAM Proc Local Bus LAN Interface High Speed I/O bus Controller Slow Speed I/O bus Floppy Data rates Memory: 100 MHz, 8 bytes  800 MB/s (peak) PCI: 33 MHz, 4 bytes wide  132 MB/s (peak) SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes)  80 MB/s (peak)

15 Multiprocessors Multiprocessors have been available for decades… –1960s small MPs –1970s small MPs Dream of automatic parallelization –1980s small MPs; emergence of servers Dream of automatic parallelization –1990s expanding MPs Very large MPPs failed Dream of automatic parallelization fading –2000s wide-spread MPs; on-chip multithreading Many applications have independent threads Programmers write applications to be parallel in the first place CC P M Interconnection Network C PP M M...

16 Evaluating Computer Architectures

17 Computation Science Computation is synthetic –Many of the phenomena in the computing field are created by humans rather than occurring naturally in the physical world –Very different from nature sciences » When one discovers a fact about nature, it is a contribution, no matter how small » Creating something new alone does not establish a contribution –Anyone can create something new in a synthetic field –Rather, one must show that the creation is better

18 What Means “Better”? “Better” can mean many things –Solves a problem in less time (faster) –Solves a larger class of problems (more powerful) –Is more efficient of resources (cheaper) –Is less prone to errors (more reliable) –Is easy to manage/program (lower human cost)

19 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = = ExTime w/ E Performance w/o E F E Find how Speedup coming from some enhancement E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected Defines the Speedup that can be gained by using a special feature

20 Amdahl’s Law ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced Speedup overall = ExTime old ExTime new Speedup enhanced = 1 (1 - Fraction enhanced ) + Fraction enhanced Speedup enhanced

21 The “better” property is not simply an observation –Rather, the research will postulate that a new idea An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc. –Will lead to a “better” result –Making the connection between the idea and the improvement is as important as quantifying how much the improvement is The contribution is the idea, and is generally a component of a larger computational system.

22 How to Evaluate Architecture Ideas Measuring/observing/analyzing real systems –Accurate results –Need a working system » Too expensive to evaluate architecture/system ideas

23 Analytic models –Fast & easy analysis of relations –Tprogram = NumOfInst (T cpu + T m (1- Cache hit )) –Allows extrapolation to ridiculous parameters, e.g. thousands of processors –Sometimes infeasible to obtain accuracy (e.g. modeling caches) –To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention) –Queuing theory is a commonly used technique

24 Simulation –The most popular method in computer architecture and system research –Mimic the architecture/system using software –Very flexible: nearly unlimited evaluation –Prototyping of non-existing machines possible –Evaluation of design options (design space exploration) cheap & flexible –Requires some sort of validation –Can be VERY slow

25 Tradeoff between accuracy and computational intensity –Low level of abstraction slow (e.g. simulating at the level of gates) –High level of abstraction fast (e.g. only simulating processor, cache and memory components) Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated

26 Three Simulation Techniques Profile-based static modeling –Simplest and least costly –Use hardware counters on the chip or instrumented execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM) Trace-driven –A more sophisticated technique –How it works (Ex. modeling memory system performance): Collect traces generated by ATOM Trace format: inst address executed, data address accessed Build the memory hierarchy model Feed trace in the simulation model and analyze results

27 1. Compile: pgcc –Mprof=func prg.cc 2. Run the code: to produce a profile data file called pgprof.out 3. View the execution profile: pgprof prprof.out

28 Using Perfex Usage : perfex [-e num] [-y] program [program args] -e num: count only event type num; -y: generate a “cost report”; Example perfex –e 41 –13 –y a.out EVENT # Event Events Counted 41 Floating point OP retired L2 cache lines loaded Statistics: MFLOPS Main memory  L2 bandwidth MB/s

29 Execution–driven –The most accurate and most costly –Trace-driven can not simulate the interaction between memory system and processor –A detailed of the memory system and the processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar

30 Measuring by Means of Benchmarks Micro-benchmarks (e.g. instruction latencies, file system throughput) Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2) Only limited evaluation possible (e.g. limited systems support for measurement) The machine must be available Benchmark Suites: Collection of kernels, real and benchmark programs, lessening the weakness of any one benchmark by the presence of others.

31 Summarize Results Weighted Arithmetic Mean Execution Time  (W i *T i ) –Summarize the products of weighting factors and execution times and reflect individual frequency of each workload –W i = 1/(Time i *  n j =1 (1/Time j )) Geometric Mean Execution Time (  T i /N i ) 1/n –Normalize execution times to a reference machine and take the average of normalized execution times –Used by SPEC

32 A Report Example (P&H figure 1.17) Normalized to ANormalized to BNormalized to C ABCABCABC Program P Program P Arithmetic mean Geometric Mean