Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor.

Similar presentations


Presentation on theme: "Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor."— Presentation transcript:

1 Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor

2 Context 2 Floating-point arithmetic is ubiquitous

3 Context 3 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”)

4 Context 4 032 16 84 Significand (23 bits)Exponent (8 bits) 0x40000000 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x4000000000000000 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.0:

5 Context 5 032 16 84 Significand (23 bits)Exponent (8 bits) 0x40200000 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x4005000000000000 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 2.625:

6 Context 6 032 16 84 Significand (23 bits)Exponent (8 bits) 0x3DCCCCCD 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x3FB999999999999A Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 0.1:

7 Context 7 032 16 84 Significand (23 bits)Exponent (8 bits) 0x3F9DF3B6 03264 16 84 Significand (52 bits)Exponent (11 bits) 0x3FF3BE76C8B43958 Floating-point arithmetic represents real numbers as (± 1.frac × 2 exp ) – Sign bit – Exponent – Significand (“mantissa” or “fraction”) Representing 1.234:

8 Context 8 Floating-point is ubiquitous but problematic – Rounding error Accumulates after many operations Not always intuitive (e.g., non-associative) Naïve approach: higher precision – Lower precision is preferable Tesla K20X is 2.3X faster in single precision Xeon Phi is 2.0X faster in single precision Single precision uses 50% of the memory bandwidth

9 Problem 9 Current analysis solutions are lacking – Numerical analysis methods are difficult – Static analysis is too conservative – Trial-and-error is time-consuming We need better analysis solutions – Produce easy-to-understand results – Incorporate runtime effects – Automated or semi-automated

10 Thesis 10 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

11 Contributions 11 1.Floating-point software analysis framework 2.Cancellation detection 3.Mixed-precision configuration 4.Reduced-precision analysis Initial emphasis on capability over performance

12 Example: Sum2PI_X 12 int sum2pi_x() { int i, j, k; real x, y, acc, sum; real final = PI * OUTER; /* correct answer */ sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { /* calculate 2^j */ x = 1.0; for (k=0; k<j; k++) x *= 2.0; /* 870K execs */ /* approximately calculate pi */ y = (real)PI / x; /* 58K execs */ acc += y; /* 58K execs */ } sum += acc; /* 2K execs */ } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

13 Contribution 1 of 4 13 Software Framework

14 Framework 14 CRAFT: Configurable Runtime Analysis for Floating-point Tuning

15 Framework 15 Dyninst: a binary analysis library – Parses executable files (InstructionAPI & ParseAPI) – Inserts instrumentation (DyninstAPI) – Supports full binary modification (PatchAPI) – Rewrites binary executable files (SymtabAPI) Binary-level analysis benefits – Programming language-agnostic – Supports closed third-party libraries – Sensitive to compiler transformations

16 Framework 16 CRAFT framework – Dyninst-based binary mutator (C/C++) – Swing-based GUI viewers (Java) – Automated search scripts (Ruby) Proof-of-concept analyses – Instruction counting – Not-a-Number (NaN) detection – Range tracking (from Brown et al. 2007)

17 Sum2PI_X 17 No NaNs detected

18 Contribution 2 of 4 18 Cancellation Detection

19 Cancellation 19 Loss of significant digits due to subtraction Cancellation detection – Instrument every addition and subtraction – Report cancellation events 2.491264 (7) 1.613647 (7) - 2.491252 (7) - 1.613647 (7) 0.000012 (2) 0.000000 (0) (5 digits cancelled) (all digits cancelled) PRECISION

20 20 Cancellation: GUI

21 21 Cancellation: GUI

22 Cancellation: Sum2PI_X 22 VersionSignificand Size (bits) Canceled Bits Single2318 Mixed23/5223 Double5229

23 Cancellation: Results 23 Gaussian elimination – Detect effects of a small pivot value – Highlight algorithmic differences Domain-specific insights – Dense point fields – Color saturations Error checking – Larger cancellations are better

24 Cancellation: Conclusions 24 Automated analysis can detect cancellation Cancellation detection serves a wide variety of purposes Later work expanded the ability to identify problematic cancellation [Benz et al. 2012]

25 Contribution 3 of 4 25 Mixed Precision

26 26 Tradeoff: Single (32 bits) vs. Double (64 bits) Single precision is faster – 2X+ computational speedup in recent hardware – 50% reduction in memory storage and bandwidth Double precision is more accurate – 16 digits vs. 7 digits

27 Mixed Precision 27 Most operations use single precision Crucial operations use double precision 1: LU ← PA 2: solve Ly = Pb 3: solve Ux 0 = y 4: for k = 1, 2,... do 5:r k ← b – Ax k-1 6:solve Ly = Pr k 7:solve Uz k = y 8:x k ← x k-1 + z k 9:check for convergence 10: end for Red text indicates double-precision (all other steps are single-precision) Mixed-precision linear solver [Buttari 2008] Difficult to prototype 50% speedup on average (12X in special cases)

28 Mixed Precision 28 Original Binary Modified Binary CRAFT Double PrecisionMixed Precision Mixed Config

29 Mixed Precision 29 Simulate single precision by storing 32-bit version inside 64-bit double-precision field

30 Mixed Precision 30 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 2mulsd -0x78(%rsp) * %xmm0  %xmm0 3addsd -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

31 Mixed Precision 31 gvec[i,j] = gvec[i,j] * lvec[3] + gvar 1movsd 0x601e38(%rax, %rbx, 8)  %xmm0 check/replace -0x78(%rsp) and %xmm0 2mulss -0x78(%rsp) * %xmm0  %xmm0 check/replace -0x4f02(%rip) and %xmm0 3addss -0x4f02(%rip) + %xmm0  %xmm0 4movsd %xmm0  0x601e38(%rax, %rbx, 8)

32 Mixed Precision 32

33 Mixed Precision 33 push %rax push %rbx mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag next: pop %rbx pop %rax # e.g. addsd => addss

34 Mixed Precision 34 Question: Which parts to replace? Answer: Automatic search – Empirical, iterative feedback loop – User-defined verification routine – Heuristic search optimization

35 Automated Search 35

36 Automated Search 36

37 Automated Search 37 Keys to search algorithm – Depth-first search Look for replaceable larger structures first Modules, functions, blocks, etc. – Prioritization Inspect highly-executed routines first

38 Mixed Precision: Sum2PI_X 38 Failed single-precision replacement

39 Mixed Precision: Sum2PI_X 39 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64? ✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

40 Mixed Precision: Sum2PI_X 40 int sum2pi_x() { int i, j, k; real x, y, acc; sum_type sum; real final = PI * OUTER; sum = 0.0; for (i=0; i<OUTER; i++) { acc = 0.0; for (j=1; j<INNER; j++) { x = 1.0; for (k=0; k<j; k++)  x *= 2.0;  y = (real)PI / x;  acc += y; }  sum += acc; } real err = abs(final-sum)/abs(final); if (err < EPS) printf(“SUCCESSFUL!\n"); else printf(“FAILED!!!\n"); } real 3264 sum type 32 ✗ 64 ✔✔ /* SUM2PI_X – approximate pi*x in a computationally- * heavy way to demonstrate various CRAFT analyses */ /* constants */ #define PI 3.14159265359 #define EPS 1e-7 /* loop iterations; OUTER is X */ #define OUTER 2000 #define INNER 30

41 Mixed Precision: Results 41 SuperLU – Lower error threshold = fewer replacements Threshold% Executions Replaced Final Error 1.0e-0399.91.59e-04 1.0e-0487.34.42e-05 7.5e-0552.54.40e-05 5.0e-0545.23.00e-05 2.5e-0526.61.69e-05 1.0e-05 1.67.15e-07 1.0e-06 1.64.7e7-07

42 Mixed Precision: Results 42 SuperLU – Lower error threshold = fewer replacements ThresholdInstructions Replaced % Static % Dynamic Final Error 1.0e-0399.199.91.59e-04 1.0e-0494.187.34.42e-05 7.5e-0591.352.54.40e-05 5.0e-0587.945.23.00e-05 2.5e-0580.326.61.69e-05 1.0e-0575.4 1.67.15e-07 1.0e-0672.6 1.64.7e7-07

43 Mixed Precision: Results 43 AMGmk – Highly-adaptive multigrid microkernel – Built-in error tolerance – Search found complete replacement – Manual conversion Speedup: 175s to 95s (1.8X) Conventional x86_64 hardware

44 Mixed Precision: Results 44 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,6473,85476.285.7 bt.A6,6823,83275.981.6 cg.W94027093.7 6.4 cg.A93422994.7 5.3 ep.W39711293.730.7 ep.A39711393.123.9 ft.W4227284.4 0.3 ft.A4227393.6 0.2 lu.W5,9573,76973.765.5 lu.A5,9292,81480.469.4 mg.W1,35145884.428.0 mg.A1,35145684.124.4 sp.W4,7725,72936.945.8 sp.A4,8215,04451.943.0

45 Mixed Precision: Results 45 Benchmark (name.CLASS) Candidate Instructions Configurations Tested Instructions Replaced % Static % Dynamic bt.W6,2283,93473.783.2 bt.A6,2624,00073.678.6 cg.W96225190.7 7.4 cg.A95625590.4 5.6 ep.W42311792.047.2 ep.A42311491.745.5 ft.W4267593.0 0.3 ft.A4267493.0 0.2 lu.W6,0384,11769.257.4 lu.A6,0143,05777.157.4 mg.W1,39344389.439.2 mg.A1,39343788.836.6 sp.W4,4585,12438.140.5 sp.A4,5074,92047.930.5

46 Mixed Precision: Results 46 Benchmark (name.CLASS) Candidate Instructions Configurations Tested % Dynamic Replaced bt.W6,2283,93483.2 bt.A6,2624,00078.6 cg.W962251 7.4 cg.A956255 5.6 ep.W42311747.2 ep.A42311445.5 ft.W42675 0.3 ft.A42674 0.2 lu.W6,0384,11757.4 lu.A6,0143,05757.4 mg.W1,39344339.2 mg.A1,39343736.6 sp.W4,4585,12440.5 sp.A4,5074,92030.5

47 Mixed Precision: Results 47 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested Operands Replaced % Static % Dynamic bt.A2,34230098.397.0 cg.A2876896.271.3 ep.A2365996.237.9 ft.A46610894.246.2 lu.A1,74210498.599.9 mg.A59715395.683.4 sp.A1,5251,09494.588.9

48 Mixed Precision: Results 48 Memory-based analysis – Replacement candidates: output operands – Generally higher replacement rates – Analysis found several valid variable-level replacements Benchmark (name.CLASS) Candidate Operands Configurations Tested % Executions Replaced bt.A2,34230097.0 cg.A2876871.3 ep.A2365937.9 ft.A46610846.2 lu.A1,74210499.9 mg.A59715383.4 sp.A1,5251,09488.9

49 Mixed Precision: Conclusions 49 Automated tools can prototype mixed-precision configurations Automated search can provide precision-level replacement insights Precision analysis could provide another “knob” for application tuning Even if computation requires double precision, storage/communication may not

50 Contribution 4 of 4 50 Reduced Precision

51 51 Simulate reduced precision with truncation – Truncate result after every operation – Allows zero up to double (64-bit) precision – Less overhead (fewer added operations) Search routine – Identifies component-level precision requirements 0 SingleDouble SingleDouble vs.

52 Reduced Precision: GUI 52 Bit-level precision requirements 0 Single Double

53 Reduced Precision: Sum2PI_X 53 0 bits (single – exponent only) 22 bits (single) 27 bits (double – overly conservative) 32 bits (double)

54 Reduced Precision 54 Faster search convergence compared to mixed-precision analysis BenchmarkInstructionsOriginal Wall time (s) Speedup cg.A9561,30559.2% ep.A42397842.5% ft.A42682550.2% lu.A6,014514,33286.7% mg.A1,3932,89866.0% sp.A4,507422,37144.1%

55 Reduced Precision 55 General precision requirement profiles Low sensitivityHigh sensitivity

56 Reduced Precision: Results NAS (top) & LAMMPS (bottom) 56 bt.A (78.6%) chute mg.A (36.6%) ft.A (0.2%) lj rhodo

57 Reduced Precision: Results NAS mg.W (incremental) 57 >5.0% - 4:66 >0.1% - 15:45 >1.0% - 5:93 >0.5% - 9:45 >0.05% - 23:60Full – 28:71

58 Reduced Precision: Conclusions 58 Automated analysis can identify general precision level requirements Reduced-precision analysis provides results more quickly than mixed-precision analysis Incremental searches reduce the time to solution without sacrificing fidelity

59 Contributions 59 General floating-point analysis framework – 32.3K LOC total in ~200 files – LGPL on Sourceforge: sf.net/p/crafthpc Cancellation detection – WHIST’11 paper, PARCO 39/3 article Mixed-precision configuration – SC’12 poster, ICS’13 paper Reduced-precision analysis – ICS’14 submission in preparation

60 Future Work 60 Short term – Optimization and platform ports – Analysis extension and composition – Further case studies Long term – Compiler-based implementation – IDE and development cycle integration – Program modeling and verification

61 Conclusion 61 Automated runtime analysis techniques can inform application developers regarding floating-point behavior, and can provide insights to guide developers towards reducing precision with minimal impact on accuracy.

62 Acknowledgements 62 – Collaborators – Jeff Hollingsworth (advisor) and Pete Stewart (UMD) Bronis de Supinski, Matt Legendre, et al. (LLNL) – Colleagues – Ananta Tiwari, Tugrul Ince, Geoff Stoker, Nick Rutar, Ray Chen, et al. CS Department @ UMD Intel XED2 – Family & Friends – Lindsay Lam (spouse) Neil & Alice Lam, Barry & Susan Walters Wallace PCA and Elkton EPC cartoon by Nick Rutar


Download ppt "Automated Floating-Point Precision Analysis Michael O. Lam Ph.D. Defense 6 Jan 2014 Jeff Hollingsworth, Advisor."

Similar presentations


Ads by Google