10/23/01CSE 260 - Performance programming CSE 260 – Introduction to Parallel Computation Topic 7: A few words about performance programming October 23,

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

The University of Adelaide, School of Computer Science

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture: Pipelining Basics

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

10/9/01CSE Project CSE 260 – Introduction to Parallel Computation 2-D Wave Equation Suggested Project.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Chapter 12 CPU Structure and Function. Example Register Organizations.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Appendix A Pipelining: Basic and Intermediate Concepts

Lecture: Pipelining Basics

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Lecture 24: CPU Design Today’s topic –Multi-Cycle ALU –Introduction to Pipelining 1.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.

NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.

Performance Optimization Getting your programs to run faster CS 691.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Performance Optimization Getting your programs to run faster.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.

Pipelining and Parallelism Mark Staveley

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,

IBM ATS Deep Computing © 2007 IBM Corporation Compiler Optimization HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki,

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

CPE 779 Parallel Computing - Spring Memory Hierarchies & Optimizations Case Study: Tuning Matrix Multiply Based on slides by: Kathy Yelick

1 Lecture 5a: CPU architecture 101 boris.

Instruction Level Parallelism

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Lecture: Pipelining Basics

CS 105 Tour of the Black Holes of Computing

Pipeline Implementation (4.6)

Morgan Kaufmann Publishers

Register Pressure Guided Unroll-and-Jam

Alpha Microarchitecture

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Control unit extension for data hazards

Siddhartha Chatterjee

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Memory System Performance Chapter 3

Control unit extension for data hazards

Lecture: Pipelining Basics

Presentation transcript:

10/23/01CSE Performance programming CSE 260 – Introduction to Parallel Computation Topic 7: A few words about performance programming October 23, 2001

CSE Performance programming2 Announcements Office hours tomorrow: 1:30 – 2:50 Highly recommended talk tomorrow (Weds) at 3:00 in SDSC’s auditorium: “Benchmarking and Performance Characterization in High Performance Computing” John McCalpin IBM For more info, see Disclaimer: I’ve only read the abstract, but it sounds relevant to this class and interesting.

CSE Performance programming3 Approach to Tuning Code Engineer’s method: –DO UNTIL (exhausted) – tweak something – IF (better) THEN accept_change Scientific method: –DO UNTIL (enlightened) – make hypothesis – experiment – revise hypothesis

CSE Performance programming4 IBM Power3’s power … and limits Eight pipelined functional units –2 floating point –2 load/store –2 single-cycle integer –1 multi-cycle integer –1 branch Powerful operations –Fused multiply-add (FMA) –Load (or Store) update –Branch on count Launch  4 ops per cycle Can’t launch 2 stores/cyc FMA pipe 3-4 cycles long Memory hierarchy speed Processor in Blue Horizon

CSE Performance programming5 Can its power be harnessed? for (j=0; j<n; j+=4){ p00 += a[j+0]*a[j+2]; m00 -= a[j+0]*a[j+2]; p01 += a[j+1]*a[j+3]; m01 -= a[j+1]*a[j+3]; p10 += a[j+0]*a[j+3]; m10 -= a[j+0]*a[j+3]; p11 += a[j+1]*a[j+2]; m11 -= a[j+1]*a[j+2]; } 8 FMA’s 4 Loads Runs at 4.6 cycles/iteration ( 1544 MFLOP/S on 444 MHz processor) CL.6: FMA fp31=fp31,fp2,fp0,fcr LFL fp1=(*)double(gr3,16) FNMS fp30=fp30,fp2,fp0,fcr LFDU fp3,gr3=(*)double(gr3,32) FMA fp24=fp24,fp0,fp1,fcr FNMS fp25=fp25,fp0,fp1,fcr LFL fp0=(*)double(gr3,24) FMA fp27=fp27,fp2,fp3,fcr FNMS fp26=fp26,fp2,fp3,fcr LFL fp2=(*)double(gr3,8) FMA fp29=fp29,fp1,fp3,fcr FNMS fp28=fp28,fp1,fp3,fcr BCT ctr=CL.6,

CSE Performance programming6 Can its power be harnessed (part II) 8 FMA, 4 Loads: 1544 MFLOP/sec (1.15 cycle/load) (previous slide) 8 FMA, 8 Loads for (j=0; j<n; j+=8){ p00 += a[j+0]*a[j+2]; m00 -= a[j+0]*a[j+2]; p01 += a[j+1]*a[j+3]; m01 -= a[j+1]*a[j+3]; p10 += a[j+4]*a[j+6]; m10 -= a[j+4]*a[j+6]; p11 += a[j+5]*a[j+7]; m11 -= a[j+5]*a[j+7]; } MFLOP/sec (0.6 cycle/load) -“Interactive” node: 740 MFLOP/sec (1.2 cycle/load) -Interactive nodes have 1 cycle/MemOp barrier! the AGEN unit is

CSE Performance programming7 A more realistic computation s Dot Product (DDOT): Z +=   X[i]  Y[i] = load x x ++ 2N float ops 2N+2 load/stores 4-way concurrency = store x + x +

CSE Performance programming8 Optimized matrix x vector product: y = A xx x x+ + “steady state”: 6 float ops 6 float ops 4 load/stores 4 load/stores 10-way concurrency +x x x+ + +x x x+ + +x x x+ + + y[i] y[i+1] y[i+2] x[j] x[j+1] x[j+2] x[j+3]

CSE Performance programming9 FFT “butterfly” = load x x x x float ops 10 load/stores 4-way concurrency 4-way concurrency = store Note: on typical processor, this leaves half the ALU power unused.

CSE Performance programming10 FLOP to MemOp ratio Most programs have at most one FMA per MemOp –DAXPY (Z[i] += A X[i] + Y[i]): 2 Loads, 1 Store, 1 FMA –DDOT (Z +=  X[i] Y[i]): 2 Loads, 1 FMA –Matrix-vector product: (k+1) loads, k FMA’s –FFT butterfly: 8 MemOps, 10 floats (5 or 6 FMA) A few have more (but they are in libraries) –Matrix multiply (well-tuned): 2 FMA’s per load –Radix-8 FFT Your program is probably limited by loads and stores ! Floating point Multiply Add

CSE Performance programming11 Need independent instructions Remember Little’s Law! 4 ins/cyc x 3 cycle pipe  need 12-way independence Many recent and future processors need even more. Out-of-order execution helps. But limited by instruction window size & branch prediction. Compiler unrolling of inner loop also helps. Compiler has inner loop execute, say, 4 points, then interleaves the operations. Requires lots of registers.

CSE Performance programming12 How unrolling gets more concurrency 12 independent operations/cycle

CSE Performance programming13 Improving the MemOp to FLOP Ratio for (i=1; i<N; i++) for (j=1; j<N; j++) b[i,j] = 0.25 * (a[i-1][j] + a[i+1][j] + a[i,j-1] + a[i][j-1]); for (i=1; i<N-2; i+=3) { for(j=1; j<N; j++) { b[i+0][j] =... ; b[i+1][j] =... ; b[i+2][j] =... ; } for (i = i; i < N; i++) {... ; /* Do last rows */ 3 loads 1 store 4 floats 5 loads 3 store 12 floats

CSE Performance programming14 Overcoming pipeline latency for (i=0; i<size; i++) { sum = a[i] + sum; } for (i=0; i<size; i+=8) { s0 += a[i+0]; s4 += a[i+4]; s1 += a[i+1]; s5 += a[i+5]; s2 += a[i+2]; s6 += a[i+6]; s3 += a[i+3]; s7 += a[i+7]; } sum = s0+s1+s2+s3+s4+s5+s6+s7; 3.86 cycles/addition 0.5 cycle/addition Next add can’t start until previous is finished (3 to 4 cycles later) May change answer due to different rounding.

CSE Performance programming15 The SP Memory Hierarchy L1 data cache –64 KBytes = 512 lines, each 128 Bytes long –128-way set associative (!!) –Prefetching for up to 4 streams –6 or 7 cycle penalty for L1 cache miss Data TLB –1 MB = 256 pages, each 4KBytes –2-way set associative –25 to 100’s cycle penalty for TLB miss L2 cache –4 MByte = 32,768 lines, each 128 Bytes long –1-way (4-way on Nighthawk 2) associative, randomized (!!) –Only can hold data from 1024 different pages –35 cycle penalty for L2 miss

CSE Performance programming16 So what?? Sequential access (or small stride) are good Random access within a limited range is OK –Within 64 KBytes – in L1; 1 cycle per MemOp –Within 1 MByte – up to 7-cycle L1 penalty per 16 words (prefetching hides some cache miss penalty) –Within 4 MByte – May get 25 cycle TLB penalty –Larger range – huge (80 – 200 cycle) penalties

CSE Performance programming17 Stride-one memory access sum list of floats times for second time through data L1 cache L2 cache TLB Uncached data – 4.6 cycles per load

CSE Performance programming18 Strided Memory Access > 1 element/cacheline> 1 element/page1 element/page (as bad as it gets) TLB misses start Program adds Byte integers located at given stride (performed on “interactive node”) L1 misses start

CSE Performance programming19 Sun E10000’s Sparc II processor See MHz on ultra, 336 MHz on gaos 9 stage instruction pipe (3 stage delay due to forwarding) Max of 4 instructions initiated per cycle At most two floats/cycle (no FMA instruction) Max of one load or store per cycle 16 KB Data cache (+ 16 KB Icache), 4 MB L2 cache 64 Byte line size in L2. Max bandwidth to L2 cache: 16 Bytes/cycle (4 cycles/line) Max bandwidth to memory: 4 Bytes/cycle (16 cycles/line) Shared among multiple processors Can do prefetching

CSE Performance programming20 Reading Assembly Code Example code: do i = 3, n V(i) = M(i,j)*V(i) + V(i-2) IF (V(i).GT.big) V(i) = big enddo Compiled via “f77 –fast –S” –On Sun’s Fortran – this is 7 years old Sun idiosyncrasies: –Target register is given last in 3-address code –Delayed branch jumps after following statement –Fixed point registers have names like %o3, %l2, or %g1.

CSE Performance programming21 SPARC Assembly Code.L900120: ldd [%l2+%o1],%f4 Load V(i) into register %f4 fmuld %f2,%f4,%f2 Float multiply double ldd [%o5+%o1],%f6 Load V(i-2) into %f6 faddd %f2,%f6,%f2 std %f2,[%l2+%o1] Store %f2 into V(i) ldd [%l2+%o1],%f4 Reload it (!?!) fcmped %f4,%f8 Compare to “big” nop fbg,a.L77015 Float branch on greater std %f8,[%l2+%o1] Conditionally store “big”.L77015: add %g1,1,%g1 Increment index variable cmp %g1,%o7 Are we done? add %o0,8,%o0 Increment offset into M add %o1,8,%o1 Increment offset into V ble,a.L “Delayed” branch ldd [%l1+%o0],%f2 Load M(i,j) into %f2