Topics Speedup Amdahl’s law Execution timeReadings January 10, 2012 CSCE 713 Computer Architecture.

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Computer Abstractions and Technology
CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.
1 ECE 570– Advanced Computer Architecture Dr. Patrick Chiang Winter 2013 Tues/Thurs 2-4PMPM.
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1: Operating Systems Overview
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
1 Chapter 01 Authors: John Hennessy & David Patterson.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Computer System Architectures Computer System Software
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Last Time Performance Analysis It’s all relative
Multi-core architectures. Single-core computer Single-core CPU chip.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 2 Quantifying Performance
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
The University of Adelaide, School of Computer Science
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Classic Model of Parallel Processing
Pipelining and Parallelism Mark Staveley
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS203 – Advanced Computer Architecture
CS203 – Advanced Computer Architecture Performance Evaluation.
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
These slides are based on the book:
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
Morgan Kaufmann Publishers
Architecture & Organization 1
Morgan Kaufmann Publishers
Architecture & Organization 1
The University of Adelaide, School of Computer Science
Computer Evolution and Performance
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

Topics Speedup Amdahl’s law Execution timeReadings January 10, 2012 CSCE 713 Computer Architecture

– 2 – CSCE 713 Spring 2012 Overview Readings for today Landscape of Parallel Computing Research Berkeley View EECS Parallel Benchmarks Inspired by Berkeley Dwarfs Ka10_7dwarfsOfSymbolicComputationNew Topics overview Syllabus and other course pragmatics Website (not shown) Dates Power wall, ILP wall,  to multicore Seven Dwarfs Amdahl’s Law, Gustaphson’s law

– 3 – CSCE 713 Spring 2012 Copyright © 2012, Elsevier Inc. All rights reserved. Single Processor Performance Introduction RISC Move to multi-processor

– 4 – CSCE 713 Spring 2012 Power Wall Note both of dynamic power and energy have voltage 2 as dominant term So lower voltage improves both; 5V  1V over period of time, but then can’t continue without errors

– 5 – CSCE 713 Spring 2012 Static Power CMOS chip have power loss due to current leakage even when the transistor is off In 2006 the goal for leakage is 25%

– 6 – CSCE 713 Spring 2012 Single CPU Single Thread Programming Model

– 7 – CSCE 713 Spring 2012 Berkeley Conventional Wisdom Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power. 3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005]

– 8 – CSCE 713 Spring 2012 Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.

– 9 – CSCE 713 Spring 2012 Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.

– 10 – CSCE 713 Spring 2012 Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the squareof the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.

– 11 – CSCE 713 Spring 2012 CSAPP – Bryant O’Hallaron.

– 12 – CSCE 713 Spring 2012 Topics Covered The needs for gains in performance The need for Parallelism Amdahl’s and Gustaphson’s laws Various Problems: the 7 Dwarfs and … Various Approaches Bridges between Multithreaded Multicore Posix pthreads Intel’s TTB Distributed – MPIDistributed – MPI Shared Memory – OpenMPShared Memory – OpenMP GPUsGPUs Grid ComputingGrid Computing Cloud ComputingCloud Computing

– 13 – CSCE 713 Spring 2012 Top 10 challenges in parallel computing By Michael Wrinn (Intel) In priority order: Michael Wrinn (Intel)Michael Wrinn (Intel) 1.Finding concurrency in a program - how to help programmers “think parallel”? 2.Scheduling tasks at the right granularity onto the processors of a parallel machine. 3.The data locality problem: associating data with tasks and doing it in a way that our target audience will be able to use correctly. 4.Scalability support in hardware: bandwidth and latencies to memory plus interconnects between processing elements. 5.Scalability support in software: libraries, scalable algorithms, and adaptive runtimes to map high level software onto platform details.

– 14 – CSCE 713 Spring 2012  Synchronization constructs (and protocols) that enable programmers write programs free from deadlock and race conditions.  Tools, API’s and methodologies to support the debugging process.  Error recovery and support for fault tolerance.  Support for good software engineering practices: composability, incremental parallelism, and code reuse.  Support for portable performance. What are the right models (or abstractions) so programmers can write code once and expect it to execute well on the important parallel platforms?

– 15 – CSCE 713 Spring 2012 Berkeley Conventional Wisdom 1. Old CW: Power is free, but transistors are expensive. · New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. · New CW: For desktops and servers, static power due to leakage can be 40% of total power.

– 16 – CSCE 713 Spring 2012 Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. · New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. [Borkar 2005] [Mukherjee et al 2005] 4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. · New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability (see above), clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.

– 17 – CSCE 713 Spring Old CW: Researchers demonstrate new architecture ideas by building chips. · New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.

– 18 – CSCE 713 Spring Old CW: Performance improvements yield both lower latency and higher bandwidth. · New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. [Patterson 2004] 7. Old CW: Multiply is slow, but load and store is fast. · New CW is the “Memory wall” [Wulf and McKee 1995]: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.

– 19 – CSCE 713 Spring Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. · New CW is the “ILP wall”: There are diminishing returns on finding more ILP.

– 20 – CSCE 713 Spring 2012 Old CW: Uniprocessor performance doubles every 18 months. · New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. Figure 2 plots processor performance for almost 30 years. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and The doubling of uniprocessor performance may now take 5 years.

– 21 – CSCE 713 Spring 2012 Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer. · New CW: It will be a very long wait for a faster sequential computer

– 22 – CSCE 713 Spring 2012 Old CW: Increasing clock frequency is the primary method of improving processor performance. · New CW: Increasing parallelism is the primary method of improving processor performance. 12. Old CW: Less than linear scaling for a multiprocessor application is failure. · New CW: Given the switch to parallel computing, any speedup via parallelism is a success.

– 23 – CSCE 713 Spring 2012.

– 24 – CSCE 713 Spring 2012 Amdahl’s Law Suppose you have an enhancement or improvement in a design component. Suppose you have an enhancement or improvement in a design component. The improvement in the performance of the system is limited by the % of the time the enhancement can be used

– 25 – CSCE 713 Spring 2012 Exec Time of Parallel Computation

– 26 – CSCE 713 Spring 2012 Gustafson’s Law: Scale the problem

– 27 – CSCE 713 Spring 2012 Matrix Multiplication – scaling the problem Note we would really scale a model of a “real problem,” but matrix multiplication might be one step required

– 28 – CSCE 713 Spring 2012

– 29 – CSCE 713 Spring 2012 High-end simulation in the physical sciences = 7 numerical methods : 1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) 2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic, software, and architecture standpoint Phillip Colella’s “Seven dwarfs” If add 4 for embedded, covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Combinational logic 11. Finite State Machine Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella,

– 30 – CSCE 713 Spring 2012 Seven Dwarfs - Dense Linear Algebra Data are dense matrices or vectors. Generally, such applications use unit-stride memory accesses to read data from rows, andGenerally, such applications use unit-stride memory accesses to read data from rows, and strided accesses to read data from columns.strided accesses to read data from columns. Communication patternCommunication pattern Black is no communication

– 31 – CSCE 713 Spring 2012 Seven Dwarfs -Sparse Linear Algebra.

– 32 – CSCE 713 Spring 2012 Seven Dwarfs –Spectral Methods (e.g., FFT).

– 33 – CSCE 713 Spring 2012 Seven Dwarfs - N-Body Methods Depends on interactions between many discrete points. Variations include particle-particle methods, where every point depends on all others, leading to an O(N2) calculation, and hierarchical particle methods, which combine forces or potentials from multiple points to reduce the computational complexity to O(N log N) or O(N).

– 34 – CSCE 713 Spring 2012 Seven Dwarfs –Structured Grids.

– 35 – CSCE 713 Spring 2012 Seven Dwarfs – Unstructured Grids An irregular grid where data locations are selected, usually by underlying characteristics of the application.

– 36 – CSCE 713 Spring 2012 Seven Dwarfs - Monte Carlo Calculations depend on statistical results of repeated random trials. Considered embarrassingly parallel. Communication is typically not dominant in Monte Carlo methods. EmbarrassinglyParallel / NSF Teragrid

– 37 – CSCE 713 Spring 2012 Principle of Locality Rule of thumb – A program spends 90% of its execution time in only 10% of the code. So what do you try to optimize? Locality of memory references Temporal locality Spatial locality

– 38 – CSCE 713 Spring 2012 Taking Advantage of Parallelism Logic parallelism – carry lookahead adder Word parallelism – SIMD Instruction pipelining – overlap fetch and execute Multithreads – executing independent instructions at the same time Speculative execution -

– 39 – CSCE 713 Spring 2012 Linux – Sytem Info saluda> lscpu Architecture: i686 CPU op-mode(s): 32-bit, 64-bit CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 4 CPU socket(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 15 Stepping: 11 CPU MHz: Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 4096K saluda>

– 40 – CSCE 713 Spring 2012 Control Panel  System and Sec…  System ……

– 41 – CSCE 713 Spring 2012 Task Manager.