Reproducible BLAS Wen Rui Liau Computer Science Division UC Berkeley bebop.cs.berkeley.edu/reproblas/
Remember Bob’s example? (Yesterday)
Floating Point Operations (FLOPS) Computation of floating point numbers (decimal point numbers) Often a trade-off between precision and range Encapsulated with the IEEE754 standard
Motivation Since roundoff makes floating point addition nonassociative, different orders of summation often give different answers On a parallel machine, the order of summation can vary from run to run, or even subroutine-call to subroutine- call, depending on scheduling of available resources, so answers can change Reproducibility important for debugging, validation, contractual obligations, …
Goals (1/2) Reproducible BLAS And eventually higher level libraries Reproducibility means bitwise identical answers on any computer no matter what hardware resources are available, no matter how they are scheduled, for any ordering of inputs, that would get identical results in exact arithmetic Reproducible exception-handling too (return same +∞, -∞, or NaN) Assume only limited subset of IEEE 754 standard
Goals (2/2) Performance/Accuracy requirements Accuracy at least as good as conventional summation (tunable) Do only one read-only pass over data Do only one reduction operation Use as little memory as possible, to enable tiling optimizations
Relevance to other intern projects ReproBLAS project is low-level in nature Potentially useful for the application based projects Force, Molecular dynamics: Ensures consistent summation for results Dot Product: Results may be useful if project requires access to a lower level functions
Project Timeline (1/2) Summer ‘17 1st Half: Summer ’17 2nd Half: Learn basics of Floating Point Arithmetic, Reproducible Summation, performance programming and tuning on Blue Waters system Summer ’17 2nd Half: Implement reproducible BLAS on single node system
Project Timeline (2/2) Fall ‘17: Spring ’18 Continue implementing and tuning performance of reproducible BLAS Spring ’18 Integrate reproducible BLAS into selected LAPACK and ScaLAPACK routines, test overall reproducibility and do performance evaluations Prepare findings for submission to Blue Waters
Pre-Rounding Talk about rounding modes Special rounding - You round to the nearest, break tie if exactly in between by rounding away from 0. Not in the IEE754 for binary. Four rounding modes, usual one is round to nearest, by rounding to even. Another rounding mode - Always round up, always round down. Round towards 0
Pre-Rounding
Pre-Rounding Drawback: costs 2 or 3 reduction/broadcast steps
Pre-Rounding Costs 2 or 3 reduction/broadcast steps
Indexed Summation
Indexed Summation K = 2 bins Boundaries predetermined Input split into several bins Top K = 2 bins are accumulated K = 2 bins Boundaries predetermined
Indexed Summation Only keep top K bins, don’t compute or discard rest Only need to store accumulators for top 2 bins seen so far Only keep top K bins, don’t compute or discard rest
Summation Performance Compare to gcc –O3 applied to: res=0; for ( j=0; j<N; j++ ) { res += X[j]; } Reproducible sum faster for large N ! Other performance data: 3.3 to 4.2x slower vs MKL dot product N=2.^[6:12]