L9: Project Discussion and Floating Point Issues CS6235.

Slides:

Advertisements

Similar presentations

L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.

Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.

University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point standard: Definition Example and properties.

Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

1  2004 Morgan Kaufmann Publishers Chapter Three.

L9: Next Assignment, Project and Floating Point Issues CS6963.

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

01- Intro-Java-part1 1 Introduction to Java, and DrJava Barb Ericson Georgia Institute of Technology June 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.

CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.

Georgia Institute of Technology Introduction to Java, and DrJava Barb Ericson Georgia Institute of Technology Aug 2005.

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Computing Systems Basic arithmetic for computers.

ECE232: Hardware Organization and Design

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CSC 221 Computer Organization and Assembly Language

Floating Point Arithmetic

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.

COMP201 Computer Systems Floating Point Numbers. Floating Point Numbers  Representations considered so far have a limited range dependent on the number.

Computer Architecture Lecture 22 Fasih ur Rehman.

Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.

Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.

Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

순천향대학교 정보기술공학부 이 상 정 1 3. Arithmetic for Computers.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

1 Floating Point. 2 Topics Fractional Binary Numbers IEEE 754 Standard Rounding Mode FP Operations Floating Point in C Suggested Reading: 2.4.

© David Kirk/NVIDIA and Wen-mei W

Floating Point Numbers

2.4. Floating Point Numbers

Topics IEEE Floating Point Standard Rounding Floating Point Operations

Floating Point Number system corresponding to the decimal notation

CS 367 Floating Point Topics (Ch 2.4) IEEE Floating Point Standard

Chapter 6 Floating Point

Arithmetic for Computers

Lecture 5: GPU Compute Architecture

CSCE 350 Computer Architecture

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

Introduction to CUDA Programming

Arithmetic Logical Unit

CS 105 “Tour of the Black Holes of Computing!”

Introduction to Java, and DrJava part 1

CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations

Floating Point Arithmetic August 31, 2009

CS 105 “Tour of the Black Holes of Computing!”

Introduction to Java, and DrJava

© David Kirk/NVIDIA and Wen-mei W. Hwu,

CS213 Floating Point Topics IEEE Floating Point Standard Rounding

Introduction to Java, and DrJava

Topic 3d Representation of Real Numbers

Introduction to Java, and DrJava part 1

CS 105 “Tour of the Black Holes of Computing!”

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

L9: Project Discussion and Floating Point Issues CS6235

Outline Discussion of semester projects Floating point – Mostly single precision until recent architectures – Accuracy – What’s fast and what’s not – Reading: Ch 6/7 in Kirk and Hwu, FloatingPoint.pdf FloatingPoint.pdf NVIDA CUDA Programmer’s Guide, Appendix C CS L9: Projects and Floating Point

Project Proposal (due 3/8) Team of 2-3 people Please let me know if you need a partner Proposal Logistics: – Significant implementation, worth 50% of grade – Each person turns in the proposal (should be same as other team members) Proposal: – 3-4 page document (11pt, single-spaced) – Submit with handin program: “handin CS6235 prop ” CS6235

Project Parts (Total = 50%) Proposal (5%) – Short written document, next few slides Design Review (10%) – Oral, in-class presentation 2 weeks before end Presentation and Poster (15%) – Poster session last week of class, dry run week before Final Report (20%) – Due during finals – no final for this class

Project Schedule Thursday, March 8, Proposals due Monday, April 2, Design Reviews Wednesday, April 18, Poster Dry Run Monday, April 23, In-Class Poster Presentation Wednesday, April 25, Guest Speaker

Content of Proposal I.Team members: Name and a sentence on expertise for each member II. Problem description -What is the computation and why is it important? -Abstraction of computation: equations, graphic or pseudo-code, no more than 1 page III. Suitability for GPU acceleration -Amdahl’s Law: describe the inherent parallelism. Argue that it is close to 100% of computation. Use measurements from CPU execution of computation if possible. -Synchronization and Communication: Discuss what data structures may need to be protected by synchronization, or communication through host. -Copy Overhead: Discuss the data footprint and anticipated cost of copying to/from host memory. IV.Intellectual Challenges -Generally, what makes this computation worthy of a project? -Point to any difficulties you anticipate at present in achieving high speedup CS6235

Projects – How to Approach Some questions: 1.Amdahl’s Law: target bulk of computation and can profile to obtain key computations… 2.Strategy for gradually adding GPU execution to CPU code while maintaining correctness 3.How to partition data & computation to avoid synchronization? 4.What types of floating point operations and accuracy requirements? 5.How to manage copy overhead? Can you overlap computation and copying? CS6235

Floating Point Incompatibility – Most scientific apps are double precision codes! – Graphics applications do not need double precision (criteria are speed and whether the picture looks ok, not whether it accurately models some scientific phenomena). -> Prior to GTX and Tesla platforms, double precision floating point not supported at all. Some inaccuracies in single- precision operations. In general – Double precision needed for convergence on fine meshes, or large set of values – Single precision ok for coarse meshes CS L9: Projects and Floating Point

Some key features Hardware intrinsics implemented in special functional units faster but less precise than software implementations Double precision slower than single precision, but new architectural enhancements have increased its performance Measures of accuracy – IEEE compliant – In terms of “unit in the last place” (ulps): the gap between two floating-point numbers nearest to x, even if x is one of them

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign What is IEEE floating-point format? A floating point binary number consists of three parts: – sign (S), exponent (E), and mantissa (M). – Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as: – value = (-1) S * M * {2 E }, where 1.0 ≤ M < 10.0 B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Single Precision vs. Double Precision Platforms of compute capability 1.2 and below only support single precision floating point Some systems (GTX, 200 series, Tesla) include double precision, but much slower than single precision – A single dp arithmetic unit shared by all SPs in an SM – Similarly, a single fused multiply-add unit Greatly improved in Fermi – Up to 16 double precision operations performed per warp (subsequent slides) CS L9: Projects and Floating Point

Fermi Architecture 512 cores 32 cores per SM 16 SMs 6 64-bit memory partitions

Closer look at Fermi core and SM 48 KB L1 cache in lieu of 16 KB shared memory 32-bit integer multiplies in single operation Fused multiply- add IEEE-Compliant for latest standard

Double Precision Arithmetic Up to 16 DP fused multiply-adds can be performed per clock per SM

Hardware Comparison

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign GPU Floating Point Features G80SSEIBM AltivecCell SPE PrecisionIEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handlingFlush to zero Supported, 1000’s of cycles Flush to zero NaN supportYes No Overflow and Infinity support Yes, only clamps to max norm Yes No, infinity FlagsNoYes Some Square rootSoftware onlyHardwareSoftware only DivisionSoftware onlyHardwareSoftware only Reciprocal estimate accuracy 24 bit12 bit Reciprocal sqrt estimate accuracy 23 bit12 bit log2(x) and 2^x estimates accuracy 23 bitNo12 bitNo

Summary: Accuracy vs. Performance A few operators are IEEE 754-compliant – Addition and Multiplication … but some give up precision, presumably in favor of speed or hardware simplicity – Particularly, division Many built in intrinsics perform common complex operations very fast Some intrinsics have multiple implementations, to trade off speed and accuracy – e.g., intrinsic __sin() (fast but imprecise) versus sin() (much slower) CS L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant – Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) – Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported in G80, but supported now Denormalized numbers are not supported in G80, but supported later No mechanism to detect floating-point exceptions (seems to be still true) 18 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput (G80) int and float add, shift, min, max and float mul, mad: 4 cycles per warp – int multiply (*) is by default 32-bit requires multiple cycles / warp – Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply Integer divide and modulo are expensive – Compiler will convert literal power-of-2 divides to shifts – Be explicit in cases where compiler can’t tell that divisor is a power of 2! – Useful trick: foo % n == foo & (n-1) if n is a power of 2 19 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput (G80) Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp – These are the versions prefixed with “__” – Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above – y / x == rcp(x) * y == 20 cycles per warp – sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp 20 L9: Projects and Floating Point

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign Runtime Math Library There are two types of runtime math operations – __func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) – func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() 21 L9: Projects and Floating Point

Next Class Next class – Discuss CUBLAS 2/3 implementation of matrix multiply and sample projects Remainder of the semester: – Focus on applications and parallel programming patterns CS L9: Projects and Floating Point