Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley

Similar presentations


Presentation on theme: "Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley"— Presentation transcript:

1 Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley
bebop.cs.berkeley.edu/reproblas/

2 Remember Bob’s example?
(Yesterday)

3 Floating Point Operations (FLOPS)
Computation of floating point numbers (decimal point numbers) Often a trade-off between precision and range Encapsulated with the IEEE754 standard

4 Motivation Since roundoff makes floating point addition nonassociative, different orders of summation often give different answers On a parallel machine, the order of summation can vary from run to run, or even subroutine-call to subroutine- call, depending on scheduling of available resources, so answers can change Reproducibility important for debugging, validation, contractual obligations, …

5 Goals (1/2) Reproducible BLAS
And eventually higher level libraries Reproducibility means bitwise identical answers on any computer no matter what hardware resources are available, no matter how they are scheduled, for any ordering of inputs, that would get identical results in exact arithmetic Reproducible exception-handling too (return same +∞, -∞, or NaN) Assume only limited subset of IEEE 754 standard

6 Goals (2/2) Performance/Accuracy requirements
Accuracy at least as good as conventional summation (tunable) Do only one read-only pass over data Do only one reduction operation Use as little memory as possible, to enable tiling optimizations

7 Relevance to other intern projects
ReproBLAS project is low-level in nature Potentially useful for the application based projects Force, Molecular dynamics: Ensures consistent summation for results Dot Product: Results may be useful if project requires access to a lower level functions

8 Project Timeline (1/2) Summer ‘17 1st Half: Summer ’17 2nd Half:
Learn basics of Floating Point Arithmetic, Reproducible Summation, performance programming and tuning on Blue Waters system Summer ’17 2nd Half: Implement reproducible BLAS on single node system

9 Project Timeline (2/2) Fall ‘17: Spring ’18
Continue implementing and tuning performance of reproducible BLAS Spring ’18 Integrate reproducible BLAS into selected LAPACK and ScaLAPACK routines, test overall reproducibility and do performance evaluations Prepare findings for submission to Blue Waters

10 Pre-Rounding Talk about rounding modes
Special rounding - You round to the nearest, break tie if exactly in between by rounding away from 0. Not in the IEE754 for binary. Four rounding modes, usual one is round to nearest, by rounding to even. Another rounding mode - Always round up, always round down. Round towards 0

11 Pre-Rounding

12 Pre-Rounding Drawback: costs 2 or 3 reduction/broadcast steps

13 Pre-Rounding Costs 2 or 3 reduction/broadcast steps

14 Indexed Summation

15 Indexed Summation K = 2 bins Boundaries predetermined
Input split into several bins Top K = 2 bins are accumulated K = 2 bins Boundaries predetermined

16 Indexed Summation Only keep top K bins, don’t compute or discard rest
Only need to store accumulators for top 2 bins seen so far Only keep top K bins, don’t compute or discard rest

17 Summation Performance
Compare to gcc –O3 applied to: res=0; for ( j=0; j<N; j++ ) { res += X[j]; } Reproducible sum faster for large N ! Other performance data: 3.3 to 4.2x slower vs MKL dot product N=2.^[6:12]


Download ppt "Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley"

Similar presentations


Ads by Google