CS 378: Programming for Performance. Administration Instructor: Keshav Pingali –4.126A ACES – –Office hours: W 1:30-2:30PM.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World Paul Teich Business Strategy CPG Server/Workstation AMD paul.teich.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Computer performance.
Computer Architecture CST 250 INTEL PENTIUM PROCESSOR Prepared by:Omar Hirzallah.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Basics and Architectures
CS 378: Programming for Performance. Administration Instructors: –Keshav Pingali (CS,UT) 4.126A ACES –Sid Chatterjee (IBM,
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Shashwat Shriparv InfinitySoft.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
CS 612: Software Design for High-performance Architectures.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
EKT303/4 Superscalar vs Super-pipelined.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Lynn Choi School of Electrical Engineering
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS 378: Programming for Performance
CS 395T: Topics in Multicore Programming
Parallel Processing - introduction
CS 378: Programming for Performance
The University of Adelaide, School of Computer Science
Architecture & Organization 1
CS 378: Programming for Performance
Memory Hierarchies.
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
Chapter 1 Introduction.
Vrije Universiteit Amsterdam
EE 155 / Comp 122 Parallel Computing
Presentation transcript:

CS 378: Programming for Performance

Administration Instructor: Keshav Pingali –4.126A ACES – –Office hours: W 1:30-2:30PM TA: Vishwas Srinivasan – –Office hours: MW 12-1 (from Jan 25 th )

Prerequisites Knowledge of basic computer architecture –(e.g.) PC, ALU, cache, memory Software maturity –assignments will be in C/C++ on Linux computers –ability to write medium-sized programs (~1000 lines) Self-motivation –willingness to experiment with systems Patience –this is the first instance of this course, so things may be a little rough

Coursework 4 or 5 programming projects –These will be more or less evenly spaced through the semester –Some assignments will also have short questions Final exam or final project –I will decide which one later in the semester

What this course is not about This is not a tools course –We will use a small number of tools and micro- benchmarks to understand performance, but this is not a course on how to use tools This is not a clever hacks course –We are interested in general scientific principles for performance programming, not in squeezing out every last cycle for somebody’s favorite program

What this course IS about Hardware guys invent lots of hardware features that can boost program performance However, software can only exploit these features if it is written carefully to do that Our agenda: –understand key architectural features at a high level –develop general principles and techniques that can guide us when we write high-performance programs More ambitious agenda (not ours): –transform high-level abstract programs into efficient programs automatically

Why study performance? Fundamental ongoing change in computer industry Until recently: Moore’ law(s) –Number of transistors on chip double every 1.5 years –Processor frequency double every 1.5 years Speed goes up by 10 roughly every 5 years Many programs ran faster if you just waited a while From now on: Moore’s law –Number of transistors on chip double every 1.5 years Transistors used to put multiple processing units (cores) on chip –Processor frequency will stay more or less the same Unless your program can exploit multiple cores, waiting for faster chips will not help you anymore

Need for multicore processors* Commercial end-customers are demanding –More capable systems with more capable processors –New systems must stay within existing power/thermal infrastructure High-level argument –Silicon designers can choose a variety of approaches to increase processor performance but these are maxing out  –Meanwhile processor frequency and power consumption are scaling in lockstep  –One solution: multicore processors *Material adapted from presentation by Paul Teich of AMD

Conventional approaches to improving processor performance Add functional units –Superscalar is known territory –Diminishing returns for adding more functional blocks –Alternatives like VLIW have been considered and rejected by the market Wider data paths –Increasing bandwidth between functional units in a core makes a difference Such as comprehensive 64-bit design, but then where to?

Conventional approaches (contd.) Deeper pipeline –Deeper pipeline buys frequency at expense of increased branch mis-prediction penalty and cache miss penalty –Deeper pipelines => higher clock frequency => more power –Industry converging on middle ground…9 to 11 stages Successful RISC CPUs are in the same range More cache –More cache buys performance until working set of program fits in cache

Power problem Moore’s Law isn’t dead, more transistors for everyone! –But…it doesn’t really mention scaling transistor power Chemistry and physics at nano-scale –Stretching materials science –Transistor leakage current is increasing As manufacturing economies and frequency increase, power consumption is increasing disproportionately There are no process quick-fixes

Static Current vs. Frequency Frequency Static Current Embedded Parts Very High Leakage and Power Fast, High Power Fast, Low Power Non-linear as processors approach max frequency

Power vs. Frequency AMD’s process: –Frequency step: 200MHz –Two steps back in frequency cuts power consumption by ~40% from maximum frequency Result: –dual-core running 400MHz slower than single-core running flat out operates in same thermal envelope Substantially lower power consumption with lower frequency

AMD Multi-Core Processor Dual-core AMD Opteron™ processor is 199mm 2 in 90nm technology Single-core AMD Opteron processor is 193mm 2 in 130nm technology

Multi-Core Software More aggregate performance for: –Multi-threaded apps (our focus) –Transactions: many instances of same app –Multi-tasking Problem –Most apps are not multithreaded –Writing multithreaded code increases software costs dramatically factor of 3 for Unreal game engine (Tim Sweeney, EPIC games)

Software problem (I): Parallel programming “We are the cusp of a transition to multicore, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require… I have talked with a few people at Microsoft Research who say this is also at or near the top of their list [of critical CS research problems].” Justin Rattner, Senior Fellow, Intel

Our focus Multi-threaded programming –also known as shared-memory programming –application program is decomposed into a number of “threads” each of which runs on one core and performs some of the work of the application: “many hands make light work” –threads communicate by reading and writing memory locations (that’s why it is called shared-memory programming) –we will use a popular system called OpenMP Key issues: –how do we assign work to different threads? –how do we ensure that work is more or less equitably distributed among the threads? –how do we make sure threads do not step on each other?

Distributed-memory programming Large-scale parallel machines like the Lonestar machine at the Texas Advanced Computing Center (TACC) use a different model of programming called –message-passing (or) –distributed-memory programming Distributed-memory programming –units of parallel execution are called processes –processes communicate by sending and receiving messages since they have no memory locations in common –most-commonly-used communication library: MPI We will study distributed-memory programming as well and you will get to run programs on Lonestar

Software problem (II): memory hierarchy Complication for parallel software –unless software also exploit caches, overall performance is usually poor –writing software that can exploit caches also complicates software development

Memory Hierarchy of SGI Octane R10 K processor: –4-way superscalar, 2 fpo/cycle, 195MHz Peak performance: 390 Mflops Experience: sustained performance is less than 10% of peak –Processor often stalls waiting for memory system to load data size access time (cycles) KB (I) 32KB (D) 1MB 128MB Regs L1 cache L2 cache Memory

Software problem (II) Caches are useful only if programs have locality of reference –temporal locality: program references to given memory address are clustered together in time –spatial locality: program references clustered in address space are clustered in time Problem: –Programs obtained by expressing most algorithms in the straight-forward way do not have much locality of reference –How do we code applications so that they can exploit caches?.

Software problem (II): memory hierarchy “…The CPU chip industry has now reached the point that instructions can be executed more quickly than the chips can be fed with code and data. Future chip design is memory design. Future software design is also memory design..… Controlling memory access patterns will drive hardware and software designs for the foreseeable future.” Richard Sites, DEC

Algorithmic questions Do programs have parallelism? If so, what patterns of parallelism are there in common applications? Do programs have locality? If so, what patterns of locality are there in common applications? We will study sequential and parallel algorithms and data structures to answer these questions

Course content Analysis of applications that need high end-to-end performance Understanding performance: performance models, Moore’s law, Amdahl's law Measurement and the design of computer experiments Micro-benchmarks for abstracting performance-critical aspects of computer systems Memory hierarchy: –caches, virtual memory –optimizing programs for memory hierarchies ……..

Course content (contd.) ….. Vectors and vectorization GPUs and GPU programming Multi-core processors and shared-memory programming, OpenMP Distributed-memory machines and message-passing programming, MPI Optimistic parallelization Self-optimizing software –ATLAS,FFTW Depending on time, we may or may not do all of these.