Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.

Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor

Context 2 Floating-point arithmetic is ubiquitous

Context 3 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”)

Context 4 032 16 84 Significand (23 bits)Exponent (8 bits) 0x40000000 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x4000000000000000 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.0:

Context 5 032 16 84 Significand (23 bits)Exponent (8 bits) 0x40200000 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x4005000000000000 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.625:

Context 6 032 16 84 Significand (23 bits)Exponent (8 bits) 0x3DCCCCCD 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x3FB999999999999A Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 0.1:

Context 7 032 16 84 Significand (23 bits)Exponent (8 bits) 0x3F9DF3B6 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x3FF3BE76C8B43958 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 1.234:

Context 8 Floating-point is ubiquitous but problematic – Rounding error Accumulates after many operations Not always intuitive (e.g., non-associative) Naïve approach: higher precision – Lower precision is preferable Tesla K20X is 2.3X faster in single precision Xeon Phi is 2.0X faster in single precision Single precision uses 50% of the memory bandwidth

Problem 9 Current analysis solutions are lacking – Numerical analysis methods are difficult – Static analysis is too conservative – Trial-and-error is time-consuming We need better analysis solutions – Produce easy-to-understand results – Incorporate runtime effects – Automated or semi-automated

Thesis 10 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Contributions 11 1.Floating-point software analysis framework 2.Cancellation detection 3.Mixed-precision configuration 4.Reduced-precision analysis Initial emphasis on capability over performance

Example: Sum2PI_X 12 int sum2pi_x() { int i, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for (k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc += y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Contribution 1 of 4 13 Software Framework

Framework 14 CRAFT: Configurable Runtime Analysis for Floating-point Tuning

Framework 15 Dyninst: a binary analysis library – Parses executable files (InstructionAPI & ParseAPI) – Inserts instrumentation (DyninstAPI) – Supports full binary modification (PatchAPI) – Rewrites binary executable files (SymtabAPI) Binary-level analysis benefits – Programming language-agnostic – Supports closed third-party libraries – Sensitive to compiler transformations

Framework 16 CRAFT framework – Dyninst-based binary mutator (C/C++) – Swing-based GUI viewers (Java) – Automated search scripts (Ruby) Proof-of-concept analyses – Instruction counting – Not-a-Number (NaN) detection – Range tracking (from Brown et al. 2007)

Sum2PI_X 17 No NaNs detected

Contribution 2 of 4 18 Cancellation Detection

Cancellation 19 Loss of significant digits due to subtraction Cancellation detection – Instrument every addition and subtraction – Report cancellation events 2.491264 (7) 1.613647 (7) - 2.491252 (7) - 1.613647 (7) 0.000012 (2) 0.000000 (0) (5 digits cancelled) (all digits cancelled) PRECISION

20 Cancellation: GUI

21 Cancellation: GUI

Cancellation: Sum2PI_X 22 VersionSignificand Size (bits) Canceled Bits Single2318 Mixed23/5223 Double5229

Cancellation: Results 23 Gaussian elimination – Detect effects of a small pivot value – Highlight algorithmic differences Domain-specific insights – Dense point fields – Color saturations Error checking – Larger cancellations are better

Cancellation: Conclusions 24 Automated analysis can detect cancellation Cancellation detection serves a wide variety of purposes Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]

Contribution 3 of 4 25 Mixed Precision

26 Tradeoff: Single (32 bits) vs. Double (64 bits) Single precision is faster – 2X+ computational speedup in recent hardware – 50% reduction in memory storage and bandwidth Double precision is more accurate – 16 digits vs. 7 digits

Mixed Precision 27 Most operations use single precision Crucial operations use double precision 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates double-precision (all other steps are single-precision) Mixed-precision linear solver [Buttari 2008] Difficult to prototype 50% speedup on average (12X in special cases)

Mixed Precision 28 Original Binary Modified Binary CRAFT Double PrecisionMixed Precision Mixed Config

Mixed Precision 29 Simulate single precision by storing 32-bit version inside 64-bit double-precision field

Mixed Precision 30 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 2mulsd -0x78(%rsp) * %xmm0  %xmm0 3addsd -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

Mixed Precision 31 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

Mixed Precision 32

Mixed Precision 33 push %rax push %rbx mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss

Mixed Precision 34 Question: Which parts to replace? Answer: Automatic search – Empirical, iterative feedback loop – User-defined verification routine – Heuristic search optimization

Automated Search 35

Automated Search 36

Automated Search 37 Keys to search algorithm – Depth-first search Look for replaceable larger structures first Modules, functions, blocks, etc. – Prioritization Inspect highly-executed routines first

Mixed Precision: Sum2PI_X 38 Failed single-precision replacement

Mixed Precision: Sum2PI_X 39 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64? ✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Mixed Precision: Sum2PI_X 40 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64 ✔✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

Mixed Precision: Results 41 SuperLU – Lower error threshold = fewer replacements Threshold% Executions Replaced Final Error 1.0e-0399.91.59e-04 1.0e-0487.34.42e-05 7.5e-0552.54.40e-05 5.0e-0545.23.00e-05 2.5e-0526.61.69e-05 1.0e-05 1.67.15e-07 1.0e-06 1.64.7e7-07

Mixed Precision: Results 42 SuperLU – Lower error threshold = fewer replacements ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e-0399.199.91.59e-04 1.0e-0494.187.34.42e-05 7.5e-0591.352.54.40e-05 5.0e-0587.945.23.00e-05 2.5e-0580.326.61.69e-05 1.0e-0575.4 1.67.15e-07 1.0e-0672.6 1.64.7e7-07

Mixed Precision: Results 43 AMGmk – Highly-adaptive multigrid microkernel – Built-in error tolerance – Search found complete replacement – Manual conversion Speedup: 175s to 95s (1.8X) Conventional x86_64 hardware

Mixed Precision: Results 44 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473,85476.285.7 bt.A6,6823,83275.981.6 cg.W94027093.7 6.4 cg.A93422994.7 5.3 ep.W39711293.730.7 ep.A39711393.123.9 ft.W4227284.4 0.3 ft.A4227393.6 0.2 lu.W5,9573,76973.765.5 lu.A5,9292,81480.469.4 mg.W1,35145884.428.0 mg.A1,35145684.124.4 sp.W4,7725,72936.945.8 sp.A4,8215,04451.943.0

Mixed Precision: Results 45 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283,93473.783.2 bt.A6,2624,00073.678.6 cg.W96225190.7 7.4 cg.A95625590.4 5.6 ep.W42311792.047.2 ep.A42311491.745.5 ft.W4267593.0 0.3 ft.A4267493.0 0.2 lu.W6,0384,11769.257.4 lu.A6,0143,05777.157.4 mg.W1,39344389.439.2 mg.A1,39343788.836.6 sp.W4,4585,12438.140.5 sp.A4,5074,92047.930.5

Mixed Precision: Results 46 Benchmark (name.CLASS) Candidate Instructions Configurations Tested % Dynamic Replaced bt.W6,2283,93483.2 bt.A6,2624,00078.6 cg.W962251 7.4 cg.A956255 5.6 ep.W42311747.2 ep.A42311445.5 ft.W42675 0.3 ft.A42674 0.2 lu.W6,0384,11757.4 lu.A6,0143,05757.4 mg.W1,39344339.2 mg.A1,39343736.6 sp.W4,4585,12440.5 sp.A4,5074,92030.5

Mixed Precision: Results 47 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested Operands Replaced % Static % Dynamic bt.A2,34230098.397.0 cg.A2876896.271.3 ep.A2365996.237.9 ft.A46610894.246.2 lu.A1,74210498.599.9 mg.A59715395.683.4 sp.A1,5251,09494.588.9

Mixed Precision: Results 48 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested % Executions Replaced bt.A2,34230097.0 cg.A2876871.3 ep.A2365937.9 ft.A46610846.2 lu.A1,74210499.9 mg.A59715383.4 sp.A1,5251,09488.9

Mixed Precision: Conclusions 49 Automated tools can prototype mixed-precision configurations Automated search can provide precision-level replacement insights Precision analysis could provide another “knob” for application tuning Even if computation requires double precision, storage/communication may not

Contribution 4 of 4 50 Reduced Precision

51 Simulate reduced precision with truncation – Truncate result after every operation – Allows zero up to double (64-bit) precision – Less overhead (fewer added operations) Search routine – Identifies component-level precision requirements 0 SingleDouble SingleDouble vs.

Reduced Precision: GUI 52 Bit-level precision requirements 0 Single Double

Reduced Precision: Sum2PI_X 53 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)

Reduced Precision 54 Faster search convergence compared to mixed-precision analysis BenchmarkInstructionsOriginal Wall time (s) Speedup cg.A9561,30559.2% ep.A42397842.5% ft.A42682550.2% lu.A6,014514,33286.7% mg.A1,3932,89866.0% sp.A4,507422,37144.1%

Reduced Precision 55 General precision requirement profiles Low sensitivityHigh sensitivity

Reduced Precision: Results NAS (top) & LAMMPS (bottom) 56 bt.A (78.6%) chute mg.A (36.6%) ft.A (0.2%) lj rhodo

Reduced Precision: Results NAS mg.W (incremental) 57 >5.0% - 4:66 >0.1% - 15:45 >1.0% - 5:93 >0.5% - 9:45 >0.05% - 23:60Full – 28:71

Reduced Precision: Conclusions 58 Automated analysis can identify general precision level requirements Reduced-precision analysis provides results more quickly than mixed-precision analysis Incremental searches reduce the time to solution without sacrificing fidelity

Contributions 59 General floating-point analysis framework – 32.3K LOC total in ~200 files – LGPL on Sourceforge: sf.net/p/crafthpc Cancellation detection – WHIST’11 paper, PARCO 39/3 article Mixed-precision configuration – SC’12 poster, ICS’13 paper Reduced-precision analysis – ICS’14 submission in preparation

Future Work 60 Short term – Optimization and platform ports – Analysis extension and composition – Further case studies Long term – Compiler-based implementation – IDE and development cycle integration – Program modeling and verification

Conclusion 61 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

Acknowledgements 62 – Collaborators – Jeff Hollingsworth (advisor) and Pete Stewart (UMD) Bronis de Supinski, Matt Legendre, et al. (LLNL) – Colleagues – Ananta Tiwari, Tugrul Ince, Geoff Stoker, Nick Rutar, Ray Chen, et al. CS Department @ UMD Intel XED2 – Family & Friends – Lindsay Lam (spouse) Neil & Alice Lam, Barry & Susan Walters Wallace PCA and Elkton EPC cartoon by Nick Rutar

Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.

Similar presentations

Presentation on theme: "Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.

Similar presentations

Presentation on theme: "Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor."— Presentation transcript:

Similar presentations

About project

Feedback