Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis.

Slides:



Advertisements
Similar presentations
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Advertisements

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Binary-Level Tools for Floating-Point Correctness Analysis Michael Lam LLNL Summer Intern 2011 Bronis de Supinski, Mentor.
1
EuroCondens SGB E.
& dding ubtracting ractions.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Addition and Subtraction Equations
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
ALGEBRA Number Walls
David Burdett May 11, 2004 Package Binding for WS CDL.
We need a common denominator to add these fractions.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Add Governors Discretionary (1G) Grants Chapter 6.
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
Randomized Algorithms Randomized Algorithms CS648 1.
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Briana B. Morrison Adapted from William Collins
Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std
TCCI Barometer March “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Chapter 1: Expressions, Equations, & Inequalities
1..
© 2012 National Heart Foundation of Australia. Slide 2.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Artificial Intelligence
When you see… Find the zeros You think….
Multiply Binomials (ax + b)(cx +d) (ax + by)(cx +dy)
Before Between After.
25 seconds left…...
Subtraction: Adding UP
U1A L1 Examples FACTORING REVIEW EXAMPLES.
: 3 00.
5 minutes.
Static Equilibrium; Elasticity and Fracture
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 1 v3.1 Module 9 TCP/IP Protocol Suite and IP Addressing.
PSSA Preparation.
 2003 Prentice Hall, Inc. All rights reserved. 1 Chapter 13 - Exception Handling Outline 13.1 Introduction 13.2 Exception-Handling Overview 13.3 Other.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.
Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
Background (Floating-Point Representation 101)  Floating-point represents real numbers as (± sig × 2 exp )  Sign bit  Significand (“mantissa” or “fraction”)
Modifying Floating-Point Precision with Binary Instrumentation Michael Lam University of Maryland, College Park Jeff Hollingsworth, Advisor.
University of Maryland Dynamic Floating-Point Error Detection Mike Lam, Jeff Hollingsworth and Pete Stewart.
University of Maryland Using Dyninst to Measure Floating-point Error Mike Lam, Jeff Hollingsworth and Pete Stewart.
Presentation transcript:

Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis de Supinski and Matt LeGendre Lawrence Livermore National Lab

Background Floating point represents real numbers as (± sgnf × 2 exp ) o Sign bit o Exponent o Significand (mantissa or fraction) Finite precision o Single-precision: 24 bits (~7 decimal digits) o Double-precision: 53 bits (~16 decimal digits) o Introduces rounding error Significand (23 bits)Exponent (8 bits) IEEE Single Significand (52 bits)Exponent (11 bits) IEEE Double

Motivation Double precision is ubiquitous o Necessary for some computations o Lack of easy-to-use techniques for reasoning about precision Single precision is preferable o Faster computation o Tesla K20X: 2.95 TFlops (singles) vs TFlops (doubles) o Intel Xeon Phi: 2.15 GFlops (singles) vs GFlops (doubles) o Standard CPUs: 2x operations w/ SSE vector operations o Reduced memory pressure o Up to 50% footprint reduction o Data movement is a bottleneck for some domains Desire: Balance speed (singles) with accuracy (doubles) 3

Mixed Precision 4 1: LU PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k x k-1 + z k 9:check for convergence 10: end for Red text indicates steps performed in double-precision (all other steps are single-precision) Mixed-precision linear solver algorithm Use double precision where necessary Use single precision where possible Nearly 2x speedups [Baboulin2008]

Our Goal Use automated analysis techniques to prototype mixed-precision variants and provide insight about a programs precision level requirements. 5

Framework CRAFT : Configurable Runtime Analysis for Floating-point Tuning Static binary instrumentation o Parse binary on disk o Replace or augment floating-point instructions with new code o Rewrite modified binary Dynamic analysis o Run modified program on representative data set o Produce results and recommendations 6

Previous Work Cancellation detection [WHIST11] o Reports loss of precision due to subtraction o Provides insight regarding numerical behavior Range tracking o Reports per-instruction min/max values o Provides insight regarding low dynamic ranges Mixed-precision variants o Replaces double-precision instructions and operands o Provides insight regarding precision-level sensitivity 7

downcast conversion In-place replacement o Narrowed focus: doubles singles o In-place downcast conversion o Flag in the high bits to indicate replacement Double Replaced Double 7FF4DEAD Non-signalling NaN Single 8Implementation

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulsd -0x78(%rsp) * %xmm0 %xmm0 3addsd -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 9

Example gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 2mulss -0x78(%rsp) * %xmm0 %xmm0 3addss -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 10

gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8) %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0 %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0 %xmm0 4movsd %xmm0 0x601e38(%rax, %rbx, 8) 11Example

Replacement Code push %rax push %rbx mov %rbx, 0xffffffff and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss 12

Dyninst Binary analysis framework o Parses executable files (InstructionAPI & ParseAPI) o Inserts instrumentation (DyninstAPI) o Supports full binary modification (PatchAPI) o Rewrites binary executable files (SymtabAPI) dyninst.org 13

Block Editing 14 double single conversion original instruction in block block splits initialization check/replace

Overhead 15 Benchmark (name.CLASS) Average Overhead bt.A50.6X cg.A6.1X ep.A13.8X ft.A10.1X lu.A28.5X mg.A14.0X sp.A19.5X

Binary Editing 16 Original Binary (mutatee) Modified Binary CRAFT (mutator) Double Precision Mixed Precision Mixed Config Configuration (parser & GUI)

Configuration 17

Automated Search Manual mixed-precision replacement o Hard to use without intuition regarding potential replacements Automatic mixed-precision analysis o Try lots of configurations (empirical auto-tuning) o Test with user-defined verification routine and data set o Exploit program control structure: replace larger structures (modules, functions) first o If coarse-grained replacements fail, try finer-grained subcomponent replacements 18

System Overview 19

Example Results 20

Example Results 21

NAS Results 22 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473, bt.A6,6823, cg.W cg.A ep.W ep.A ft.W ft.A lu.W5,9573, lu.A5,9292, mg.W1, mg.A1, sp.W4,7725, sp.A4,8215,

NAS Results 23 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283, bt.A6,262 cg.W cg.A ep.W ep.A ft.W ft.A lu.W6,0384, lu.A6,014 mg.W1, mg.A1, sp.W4,4585, sp.A4,507

AMGmk Results 24 Algebraic MultiGrid microkernel Multigrid method is iterative and highly adaptive Good candidate for replacement Automatic search Complete conversion (100% replacement) Manually-rewritten version Speedup: 175 sec to 95 sec ( 1.8X ) Conventional x86_64 hardware

SuperLU Results 25 Package for LU decomposition and linear solves Reports final error residual (useful for threshholding) Both single- and double-precision versions Verified manual conversion via automatic search Used error from provided single-precision version as threshold Final config matched single-precision profile (99.9% replacement) ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e e e e e e e e e e e e e e7-07

Future Work Memory-based analysis Case studies Search optimization 26

Conclusion Automated binary modification can build prototype mixed-precision program variants. Automated search can provide insight to focus mixed-precision implementation efforts. 27

Thank you! sf.net/p/crafthpc 28