CS 612: Software Design for High-performance Architectures.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
RISC vs CISC Yuan Wei Bin Huang Amit K. Naidu. Introduction - RISC and CISC Boundaries have blurred. Modern CPUs Utilize features of both. The Manufacturing.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.
Data Locality CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Multiscalar processors
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Data Locality CS 524 – High-Performance Computing.
An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Multi-Core Processor Technology: Maximizing CPU Performance in a Power-Constrained World Paul Teich Business Strategy CPG Server/Workstation AMD paul.teich.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Computer performance.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Basics and Architectures
CS 378: Programming for Performance. Administration Instructors: –Keshav Pingali (CS,UT) 4.126A ACES –Sid Chatterjee (IBM,
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.
CS 378: Programming for Performance. Administration Instructor: Keshav Pingali –4.126A ACES – –Office hours: W 1:30-2:30PM.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Shashwat Shriparv InfinitySoft.
THE BRIEF HISTORY OF 8085 MICROPROCESSOR & THEIR APPLICATIONS
Copyright © Curt Hill Parallelism in Processors Several Approaches.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
What is it and why do we need it? Chris Ward CS147 10/16/2008.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CC410: System Programming Dr. Manal Helal – Fall 2014 – Lecture 3.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Zubair.
CS 378: Programming for Performance
CS 395T: Topics in Multicore Programming
Parallel Processing - introduction
The University of Adelaide, School of Computer Science
CS 378: Programming for Performance
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
CS 378: Programming for Performance
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Hierarchies.
Architecture & Organization 1
CS 380C: Advanced Compiler Techniques
Optimizing MMM & ATLAS Library Generator
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Cache Models and Program Transformations
Presentation transcript:

CS 612: Software Design for High-performance Architectures

Administration Instructor: Keshav Pingali –457 Rhodes Hall TA: Milind Kulkarni –490 Rhodes Hall

Course content Understand high-end programming paradigms, compilers and runtime systems –Applications requirements –Shared-memory programming –Optimistic and pessimistic parallelization –Transactional memory –Memory hierarchy optimization –Self-optimizing systems Focus on software problem for multicore processors

Problem Silicon designers can choose a variety of methods to increase processor performance Commercial end-customers are demanding –More capable systems with more capable processors –That new systems stay within their existing power/thermal infrastructure Processor frequency and power consumption seem to be scaling in lockstep How can the industry-standard PC and Server industries stay on our historic performance curve without burning a hole in our motherboards?

What is a processor? A single chip package that fits in a socket ≥1 core (not much point in <1 core…) –Cores can have functional units, cache, etc. associated with them, just as today –Cores can be fast or slow, just as today Shared resources –More cache –Other integration: memory controllers, high-speed serial links, etc. One system interface no matter how many cores –Number of signal pins doesn’t scale with number of cores

ILP Problem Functional units –Superscalar is known territory –Diminishing returns for adding more functional blocks –Alternatives like VLIW have been considered and rejected by the market –Single-threaded architectural performance is pegged Data paths –Increasing bandwidth between functional units in a core makes a difference Such as comprehensive 64-bit design, but then where to?

ILP Problem (contd.) Pipeline –Deeper pipeline buys frequency at expense of increased cache miss penalty and lower instructions per clock –Shallow pipeline gives better instructions per clock at the expense of frequency scaling –Max frequency per core requires deeper pipelines –Industry converging on middle ground…9 to 11 stages Successful RISC CPUs are in the same range Cache –Cache size buys performance at expense of die size –Deep pipeline cache miss penalties are reduced by larger caches

Power problem Moore’s Law isn’t dead, more transistors for everyone! –But…it doesn’t really mention scaling transistor power Chemistry and physics at nano-scale –Stretching materials science –Transistor leakage current is increasing As manufacturing economies and frequency increase, power consumption is increasing disproportionately There are no process or architectural quick-fixes

Static Current vs. Frequency Frequency Static Current Embedded Parts Very High Leakage and Power Fast, High Power Fast, Low Power Non-linear as processors approach max frequency

Power vs. Frequency In AMD’s process, for 200MHz frequency steps, two steps back on frequency cuts power consumption by ~40% from maximum frequency Substantially lower power with lower frequency Result is dual-core running at n-2 in same thermal envelope as single-core running at top speed

AMD Multi-Core Processor Dual-core AMD Opteron™ processor is 199mm 2 in 90nm Single-core AMD Opteron processor is 193mm 2 in 130nm

Multi-Core Processor Architecture

Multi-Core Software More aggregate performance for: –Multi-threaded apps –Transactions: many instances of same app –Multi-tasking Problem –Most apps are not multithreaded –Writing multithreaded code increases software costs dramatically (factor of 3 for some game engines)

First problem: Parallelization “We are the cusp of a transition to multicore, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require… I have talked with a few people at Microsoft Research who say this is also at or near the top of their list [of critical CS research problems].” Justin Rattner, Senior Fellow, Intel

Second problem: memory hierarchy “…The CPU chip industry has now reached the point that instructions can be executed more quickly than the chips can be fed with code and data. Future chip design is memory design. Future software design is also memory design..… Controlling memory access patterns will drive hardware and software designs for the foreseeable future.” Richard Sites, DEC

Memory Hierarchy of SGI Octane R10 K processor: –4-way superscalar, 2 fpo/cycle, 195MHz Peak performance: 390 Mflops Experience: sustained performance is less than 10% of peak –Processor often stalls waiting for memory system to load data size access time (cycles) KB (I) 32KB (D) 1MB 128MB Regs L1 cache L2 cache Memory

Memory-wall solutions Latency avoidance: –multi-level memory hierarchies (caches) Latency tolerance: –Pre-fetching –multi-threading Techniques are not mutually exclusive: –Most microprocessors have caches and pre-fetching –Modest multi-threading is coming into vogue –Our focus: memory hierarchies

Hiding latency in numerical codes Most numerical kernels: O(n 3 ) work, O(n 2 ) data –all factorization codes Cholesky factorization: A = LL T (A is spd) LU factorization: A = LU LU factorization with pivoting: A = LU QR factorization: A = QR (Q is orthogonal) –BLAS-3: matrix multiplication  use latency avoidance techniques Matrix-vector product: O(n 2 ) work, O(n 2 ) data –use latency tolerance techniques such as pre-fetching –particularly important for iterative solution of large sparse systems

Software problem Caches are useful only if programs have locality of reference –temporal locality: program references to given memory address are clustered together in time –spatial locality: program references clustered in address space are clustered in time Problem: –Programs obtained by expressing most algorithms in the straight-forward way do not have much locality of reference –Worrying about locality when coding algorithms complicates the software process enormously.

Example: matrix multiplication Great algorithmic data reuse: each array element is touched O(N) times! All six loop permutations are computationally equivalent (even modulo round-off error). However, execution times of the six versions can be very different if machine has a cache. DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)

IJK version (large cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Large cache scenario: –Matrices are small enough to fit into cache –Only cold misses, no capacity misses –Miss ratio: Data size = 3 N 2 Each miss brings in b floating-point numbers Miss ratio = 3 N 2 /b*4N 3 = 0.75/bN = (b = 4,N=10) C B A K K

IJK version (small cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Small cache scenario: –Matrices are large compared to cache/row-major storage –Cold and capacity misses –Miss ratio: C: N 2 /b misses (good temporal locality) A: N 3 /b misses (good spatial locality) B: N 3 misses (poor temporal and spatial locality) Miss ratio  0.25 (b+1)/b = (for b = 4) C B A K K

MMM Experiments Simulated L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements

Quantifying performance differences DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Octane –L2 cache hit: 10 cycles, cache miss 70 cycles Time to execute IKJ version: 2N *0.13*4N *0.87*4N 3 = 73.2 N 3 Time to execute JKI version: 2N *0.5*4N *0.5*4N 3 = 162 N 3 Speed-up = 2.2 Key transformation: loop permutation

Even better….. Break MMM into a bunch of smaller MMMs so that large cache model is true for each small MMM  large cache model is valid for entire computation  miss ratio will be 0.75/bt for entire computation where t is

Loop tiling Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt. Parameter t (tile size) must be chosen carefully –as large as possible –working set of small matrix multiplication must fit in cache A B C It Kt Jt I K J DO It = 1,N, t DO Jt = 1,N,t DO Kt = 1,N,t DO I = It,It+t-1 DO J = Jt,Jt+t-1 DO K = Kt,Kt+t-1 C(I,J) = C(I,J)+A(I,K)*B(K,J) t t t t

Speed-up from tiling Miss ratio for block computation = miss ratio for large cache model = 0.75/bt = (b = 4, t = 200) for Octane Time to execute tiled version = 2N *0.001*4N *0.999*4N 3 = 42.3N 3 Speed-up over JKI version = 4

Observations Locality optimized code is more complex than high-level algorithm. Loop orders and tile size must be chosen carefully –cache size is key parameter –associativity matters Actual code is even more complex: must optimize for processor resources –registers: register tiling –pipeline: loop unrolling –Optimized MMM code can be ~1000 lines of C code

One solution to both problems: restructuring compilers (1985-) Programmer writes high-level architecture independent code Restructuring compiler: optimizes program for –Number of cores –Number of register –Cache organization –Instruction set: mul-add? vector extensions? …

Two key issues P1 P2 P3 …… P 1.Program restructuring: given program P, determine set of equivalent programs P1, P2, P3,… 2.Program selection: determine which program performs best on target architecture 1 2

Automatic parallelization Pessimistic parallelization: –Compiler determines partial order on program operations by determining dependences –At run-time, execute operations in parallel, respecting dependences –Works reasonably well for array programs but not for irregular data structures like trees and graphs Optimistic parallelization: –Execute operations speculatively in parallel, assuming that dependences do not exist –Check at runtime if dependences are violated –If so, roll-back execution to “safe” point and re-execute sequentially –Works only if optimism is warranted –Lots of interest in “transactional memory” which is one model of optimistic parallelization

Automatic locality enhancement Some methodology exists for array programs but little is known for irregular programs Many compilers can perform tiling and permutation automatically (gcc) Choosing parameter values: tile sizes etc. –Compiler can use architectural models –Self-optimizing systems: system determines best values using some kind of heuristic search (ATLAS,FFTW)

Course outline Applications requirements –Scientific and engineering applications –Commercial work-loads Shared-memory programming –Memory consistency models –OpenMP Optimistic and pessimistic parallelization –Dependence analysis techniques for array and irregular programs –Transactional memory models and implementations Automatic locality enhancement Self-optimizing systems

Course work Small number of programming assignments Paper presentations and class participation –We will have papers online by next Monday –Sign up for presentation by next Thursday Substantial course project independent reading implementation work presentation