L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

Slides:



Advertisements
Similar presentations
L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.
Advertisements

Lecture 6: Multicore Systems
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 Lecture 6 Performance Measurement and Improvement.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.
L9: Next Assignment, Project and Floating Point Issues CS6963.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
L11: Jacobi, Tools and Project CS6963. Administrative Issues Office hours today: – Begin at 1:30 Homework 2 graded – I’m reviewing, grades will come out.
1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Ch. 21. Square-rootingSlide 1 VI Function Evaluation Topics in This Part Chapter 21 Square-Rooting Methods Chapter 22 The CORDIC Algorithms Chapter 23.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
ME964 High Performance Computing for Engineering Applications “A computer will do what you tell it to do, but that may be much different from what you.
L9: Project Discussion and Floating Point Issues CS6235.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.
8-1 Embedded Systems Fixed-Point Math and Other Optimizations.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.
Lecture 8: 9/19/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Performance – Last Lecture Bottom line performance measure is time Performance A = 1/Execution Time A Comparing Performance N = Performance A / Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
Martin Kruliš by Martin Kruliš (v1.0)1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Single Instruction Multiple Threads
Lecture 10 CUDA Instructions
© David Kirk/NVIDIA and Wen-mei W
Exploiting Parallelism
Lecture 5: GPU Compute Architecture
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
CS170 Computer Organization and Architecture I
NVIDIA Fermi Architecture
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
Introduction to CUDA Programming
L4: Memory Hierarchy Optimization II, Locality and Data Placement
CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations
Computing in COBOL: The Arithmetic Verbs and Intrinsic Functions
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

L10: Floating Point Issues and Project CS6963

Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) – Due 5PM, Wednesday, March 4 – Where are we? CS L10: Floating Point

Outline Floating point – Mostly single precision – Accuracy – What’s fast and what’s not – Reading: Programmer’s Guide, Appendix B Project – Ideas on how to approach MPM/GIMP – Construct list of questions CS L10: Floating Point

Single Precision vs. Double Precision Platforms of compute capability 1.2 and below only support single precision floating point New systems (GTX, 200 series, Tesla) include double precision, but much slower than single precision – A single dp arithmetic unit shared by all SPs in an SM – Similarly, a single fused multiply-add unit Suggested strategy: – Maximize single precision, use double precision only where needed CS L10: Floating Point

Summary: Accuracy vs. Performance A few operators are IEEE 754-compliant – Addition and Multiplication … but some give up precision, presumably in favor of speed or hardware simplicity – Particularly, division Many built in intrinsics perform common complex operations very fast Some intrinsics have multiple implementations, to trade off speed and accuracy – e.g., intrinsic __sin() (fast but imprecise) versus sin() (much slower) CS L10: Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant – Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) – Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions 6 L10: Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp – int multiply (*) is by default 32-bit requires multiple cycles / warp – Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive – Compiler will convert literal power-of-2 divides to shifts – Be explicit in cases where compiler can’t tell that divisor is a power of 2! – Useful trick: foo % n == foo & (n-1) if n is a power of 2 7 L10: Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 8 Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp – These are the versions prefixed with “__” – Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above – y / x == rcp(x) * y == 20 cycles per warp – sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 8 L10: Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 9 Runtime Math Library There are two types of runtime math operations – __func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) – func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() 9 L10: Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 10 Make your program float-safe! Future hardware will have double precision support – G80 is single-precision only – Double precision will have additional performance cost – Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed – Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit – Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit 10 L10: Floating Point

Reminder: Content of Proposal, MPM/GIMP as Example I.Team members: Name and a sentence on expertise for each member Obvious II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page Straightforward adaptation from MPM presentation and/or code III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible Can measure sequential code CS L10: Floating Point

Reminder: Content of Proposal, MPM/GIMP as Example III.Suitability for GPU acceleration, cont. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. Some challenges, see remainder of lecture -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. Measure grid and patches to discover data footprint. Consider ways to combine computations to reduce copying overhead. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? Importance of computation, and challenges in partitioning computation, dealing with scope, managing copying overhead -Point to any difficulties you anticipate at present in achieving high speedup See previous CS L10: Floating Point

Projects – How to Approach Example: MPM/GIMP Some questions: 1.Amdahl’s Law: target bulk of computation and can profile to obtain key computations… 2.Strategy for gradually adding GPU execution to CPU code while maintaining correctness 3.How to partition data & computation to avoid synchronization? 4.What types of floating point operations and accuracy requirements? 5.How to manage copy overhead? CS L10: Floating Point

1. Amdahl’s Law Significant fraction of overall computation? – Simple test: Time execution of computation to be executed on GPU in sequential program. What is its percentage of program’s total execution time? Where is sequential code spending most of its time? – Use profiling (gprof, pixie, VTUNE, …) CS L10: Floating Point

2. Strategy for Gradual GPU… Looking at MPM/GIMP – Several core functions used repeatedly (integrate, interpolate, gradient, divergence) – Can we parallelize these individually as a first step? – Consider computations and data structures CS L10: Floating Point

2. cont. void operations ::integrate(const patch&pch,const vector &pu,vector &gu){ for(unsigned g=0;g<gu.size();g++)gu[g]=0.; for(unsigned p=0;p<pu.size();p++){ const partContribs&pcon=pch.pCon[p]; for(int k=0;k<pcon.Npor;k+=1){ const partContribs::portion&por=pcon[k]; gu[por.idx]+=pu[p]*por.weight; } gu (representing the grid) is updated, but only by nearby particles. Most data structures are read only! CS L10: Floating Point

3. Synchronization Recall from MPM Presentation Blue dots corresponding to particles (pu). Grid structure corresponds to nodes (gu). How to parallelize without incurring synchronization overhead? CS L10: Floating Point

2. and 3. Other common structure in code template void operations ::interpolate(const patch&pch,vector &pu,const vector &gu){ for(unsigned p=0;p<pu.size();p++)pu[p]=0.; for(unsigned p=0;p<pu.size();p++){ const partContribs&pcon=pch.pCon[p]; Vector2&puR=pu[p]; for(int k=0;k<pcon.Npor;k+=1){ const partContribs::portion&por=pcon[k]; const Vector2&guR=gu[por.idx]; puR.x+=guR.x*por.weight; puR.y+=guR.y*por.weight; } puR (representing the particles) is updated, but only by nearby grid points. CS L10: Floating Point

4. Floating Point MPM/GIMP is a double precision code! Phil: – Double precision needed for convergence on fine meshes – Single precision ok for coarse meshes Conclusion: – Converting to single precision (float) ok for this assignment, but hybrid single/double more desirable in the future CS L10: Floating Point

5. Copy overhead? Some example code in MPM/GIMP sh.integrate (pch,pch.pm,pch.gm); sh.integrate (pch,pch.pfe,pch.gfe); sh.divergence(pch,pch.pVS,pch.gfi); for(int i=0;i<pch.Nnode();++i)pch.gm[i]+=machTol; for(int i=0;i<pch.Nnode();++i)pch.ga[i]=(pch.gfe[i]+pch.gfi[i])/pch.gm[i]; … Exploit reuse of gm, gfe, gfi Defer copy back to host. Exploit reuse of gm, gfe, gfi Defer copy back to host. CS L10: Floating Point

Other MPM/GIMP Questions Lab machine set up? Python? Gnuplot? Hybrid data structure to deal with updates to grid in some cases and particles in other cases CS L10: Floating Point

Next Class Discussion of tools CS L10: Floating Point