L9: Next Assignment, Project and Floating Point Issues CS6963.

Slides:

Advertisements

Similar presentations

L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.

Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.

1cs542g-term CS542G - Breadth in Scientific Computing.

University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point standard: Definition Example and properties.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

1cs542g-term CS542G - Breadth in Scientific Computing.

1  2004 Morgan Kaufmann Publishers Chapter Three.

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

1 Lecture 11: Digital Design Today’s topics:  Evaluating a system  Intro to boolean functions.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

L10: Dense Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Simple Data Type Representation and conversion of numbers

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

L9: Project Discussion and Floating Point Issues CS6235.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.

8-1 Embedded Systems Fixed-Point Math and Other Optimizations.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.

CSC 221 Computer Organization and Assembly Language

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

Binary Arithmetic.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

1 Floating Point. 2 Topics Fractional Binary Numbers IEEE 754 Standard Rounding Mode FP Operations Floating Point in C Suggested Reading: 2.4.

Lecture 10 CUDA Instructions

© David Kirk/NVIDIA and Wen-mei W

Floating Point Numbers

CS 105 “Tour of the Black Holes of Computing!”

Topics IEEE Floating Point Standard Rounding Floating Point Operations

CS/COE0447 Computer Organization & Assembly Language

CS 367 Floating Point Topics (Ch 2.4) IEEE Floating Point Standard

Lecture 5: GPU Compute Architecture

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

Introduction to CUDA Programming

Arithmetic Logical Unit

CS 105 “Tour of the Black Holes of Computing!”

Introduction to Java, and DrJava part 1

CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations

CS 105 “Tour of the Black Holes of Computing!”

Floating Point Arithmetic August 31, 2009

CS 105 “Tour of the Black Holes of Computing!”

Introduction to Java, and DrJava

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Morgan Kaufmann Publishers Arithmetic for Computers

CS213 Floating Point Topics IEEE Floating Point Standard Rounding

Introduction to Java, and DrJava part 1

CS 105 “Tour of the Black Holes of Computing!”

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

L9: Next Assignment, Project and Floating Point Issues CS6963

Administrative Issues CLASS CANCELLED ON WEDNESDAY! – I’ll be at SIAM Parallel Processing Symposium Next assignment, triangular solve – Due 5PM, Friday, March 5 – handin cs6963 lab 3 ” Project proposals (discussed today) – Due 5PM, Wednesday, March 17 (hard deadline) CS L9: Projects and Floating Point

Outline Triangular solve assignment Project – Ideas on how to approach – Construct list of questions Floating point – Mostly single precision – Accuracy – What’s fast and what’s not – Reading: Ch 6 in Kirk and Hwu, FloatingPoint.pdf FloatingPoint.pdf NVIDA CUDA Programmer’s Guide, Appendix B CS L9: Projects and Floating Point

Triangular Solve (STRSM) for (j = 0; j < n; j++) for (k = 0; k < n; k++) if (B[j*n+k] != 0.0f) { for (i = k+1; i < n; i++) B[j*n+i] -= A[k * n + i] * B[j * n + k]; } Equivalent to: cublasStrsm('l' /* left operator */, 'l' /* lower triangular */, 'N' /* not transposed */, ‘u' /* unit triangular */, N, N, alpha, d_A, N, d_B, N); See: 4 L9: Projects and Floating Point CS6963

Assignment Details: – Integrated with simpleCUBLAS test in SDK – Reference sequential version provided 1. Rewrite in CUDA 2. Compare performance with CUBLAS 2.0 library 5 L9: Projects and Floating Point CS6963

Performance Issues? + Abundant data reuse - Difficult edge cases - Different amounts of work for different values - Complex mapping or load imbalance 6 L9: Projects and Floating Point CS6963

Reminder: Outcomes from Last Year’s Course Paper and poster at Symposium on Application Accelerators for High-Performance Computing (May 4, 2010 submission deadline) – Poster: Assembling Large Mosaics of Electron Microscope Images using GPU - Kannan Venkataraju, Mark Kim, Dan Gerszewski, James R. Anderson, and Mary HallAssembling Large Mosaics of Electron Microscope Images using GPU – Paper: GPU Acceleration of the Generalized Interpolation Material Point Method GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall, Phil Wallstedt, and James Guilkey Poster at NVIDIA Research Summit Poster #47 - Fu, Zhisong, University of Utah (United States) Solving Eikonal Equations on Triangulated Surface Mesh with CUDA Solving Eikonal Equations on Triangulated Surface Mesh with CUDA Posters at Industrial Advisory Board meeting Integrated into Masters theses and PhD dissertations Jobs and internships 7 L9: Projects and Floating Point CS6963

Projects 2-3 person teams Select project, or I will guide you – From your research – From previous classes – Suggested ideas from faculty, Nvidia (ask me) Example (published): – (see prev slide) Steps 1.Proposal (due Wednesday, March 17) 2.Design Review (in class, April 5 and 7) 3.Poster Presentation (last week of classes) 4.Final Report (due before finals) 8 L9: Projects and Floating Point CS6963

1. Project Proposal (due 3/17) Proposal Logistics: – Significant implementation, worth 55% of grade – Each person turns in the proposal (should be same as other team members) Proposal: – 3-4 page document (11pt, single-spaced) – Submit with handin program: “handin cs6963 prop ” CS L9: Projects and Floating Point

Content of Proposal I.Team members: Name and a sentence on expertise for each member II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? -Point to any difficulties you anticipate at present in achieving high speedup CS L9: Projects and Floating Point

Content of Proposal, cont. I.Team members: Name and a sentence on expertise for each member Obvious II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page Straightforward adaptation from sequential algorithm and/or code III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible Can measure sequential code CS L9: Projects and Floating Point

Content of Proposal, cont. III.Suitability for GPU acceleration, cont. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. Avoid global synchronization -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. Measure input and output data size to discover data footprint. Consider ways to combine computations to reduce copying overhead. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? Importance of computation, and challenges in partitioning computation, dealing with scope, managing copying overhead -Point to any difficulties you anticipate at present in achieving high speedup CS L9: Projects and Floating Point

Projects – How to Approach Some questions: 1.Amdahl’s Law: target bulk of computation and can profile to obtain key computations… 2.Strategy for gradually adding GPU execution to CPU code while maintaining correctness 3.How to partition data & computation to avoid synchronization? 4.What types of floating point operations and accuracy requirements? 5.How to manage copy overhead? CS L9: Projects and Floating Point

1. Amdahl’s Law Significant fraction of overall computation? – Simple test: Time execution of computation to be executed on GPU in sequential program. What is its percentage of program’s total execution time? Where is sequential code spending most of its time? – Use profiling (gprof, pixie, VTUNE, …) CS L9: Projects and Floating Point

2. Strategy for Gradual GPU… Looking at MPM/GIMP from last year – Several core functions used repeatedly (integrate, interpolate, gradient, divergence) – Can we parallelize these individually as a first step? – Consider computations and data structures CS L9: Projects and Floating Point

3. Synchronization in MPM Blue dots corresponding to particles (pu). Grid structure corresponds to nodes (gu). How to parallelize without incurring synchronization overhead? CS L9: Projects and Floating Point

4. Floating Point Most scientific apps are double precision codes! In general – Double precision needed for convergence on fine meshes – Single precision ok for coarse meshes Conclusion: – Converting to single precision (float) ok for this assignment, but hybrid single/double more desirable in the future CS L9: Projects and Floating Point

5. Copy overhead? Some example code in MPM/GIMP sh.integrate (pch,pch.pm,pch.gm); sh.integrate (pch,pch.pfe,pch.gfe); sh.divergence(pch,pch.pVS,pch.gfi); for(int i=0;i<pch.Nnode();++i)pch.gm[i]+=machTol; for(int i=0;i<pch.Nnode();++i)pch.ga[i]=(pch.gfe[i]+pch.gfi[i])/pch.gm[i]; … Exploit reuse of gm, gfe, gfi Defer copy back to host. Exploit reuse of gm, gfe, gfi Defer copy back to host. CS L9: Projects and Floating Point

Other Project Questions Want to use Tesla System? 32 Tesla S1070 boxes – Each with 4 GPUs – 16GB memory – 120 SMs, or 960 cores! Communication across GPUs? – MPI between hosts CS L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Brief Discussion of Floating Point To understand the fundamentals of floating-point representation (IEEE- 754) GeForce 8800 CUDA Floating-point speed, accuracy and precision – Deviations from IEEE-754 – Accuracy of device runtime functions – -fastmath compiler option – Future performance considerations

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign GPU Floating Point Features G80SSEIBM AltivecCell SPE PrecisionIEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handlingFlush to zero Supported, 1000’s of cycles Flush to zero NaN supportYes No Overflow and Infinity support Yes, only clamps to max norm Yes No, infinity FlagsNoYes Some Square rootSoftware onlyHardwareSoftware only DivisionSoftware onlyHardwareSoftware only Reciprocal estimate accuracy 24 bit12 bit Reciprocal sqrt estimate accuracy 23 bit12 bit log2(x) and 2^x estimates accuracy 23 bitNo12 bitNo

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign What is IEEE floating-point format? A floating point binary number consists of three parts: – sign (S), exponent (E), and mantissa (M). – Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as: – value = (-1) S * M * {2 E }, where 1.0 ≤ M < 10.0 B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Single Precision vs. Double Precision Platforms of compute capability 1.2 and below only support single precision floating point New systems (GTX, 200 series, Tesla) include double precision, but much slower than single precision – A single dp arithmetic unit shared by all SPs in an SM – Similarly, a single fused multiply-add unit Suggested strategy: – Maximize single precision, use double precision only where needed CS L9: Projects and Floating Point

Summary: Accuracy vs. Performance A few operators are IEEE 754-compliant – Addition and Multiplication … but some give up precision, presumably in favor of speed or hardware simplicity – Particularly, division Many built in intrinsics perform common complex operations very fast Some intrinsics have multiple implementations, to trade off speed and accuracy – e.g., intrinsic __sin() (fast but imprecise) versus sin() (much slower) CS L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant – Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) – Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions 25 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp – int multiply (*) is by default 32-bit requires multiple cycles / warp – Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive – Compiler will convert literal power-of-2 divides to shifts – Be explicit in cases where compiler can’t tell that divisor is a power of 2! – Useful trick: foo % n == foo & (n-1) if n is a power of 2 26 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp – These are the versions prefixed with “__” – Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above – y / x == rcp(x) * y == 20 cycles per warp – sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 27 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Runtime Math Library There are two types of runtime math operations – __func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) – func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() 28 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 29 Make your program float-safe! Future hardware will have double precision support – G80 is single-precision only – Double precision will have additional performance cost – Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed – Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit – Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit 29 L9: Projects and Floating Point

Next Class Reminder: class is cancelled on Wednesday, Feb. 24 Next class is Monday, March 1 – Discuss CUBLAS 2 implementation of matrix multiply and sample projects Remainder of the semester: – Focus on applications – Advanced topics (CUDA->OpenGL, overlapping computation/communication, Open CL, Other GPU architectures) CS L9: Projects and Floating Point