CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky
CS717 Problem at Hand Have matrices A and B Want to compute their product: AB Ask a matrix-matrix-multiply (MMM) implementation to compute product Answer: C Question: Is C the correct answer? How could we know for sure?
CS717 Algorithm-Based Fault Tolerance Encode input matrices via error-correcting code Run regular MMM algorithm on encoded matrices –Encoding invariant under MMM Naturally outputs encoded matrices Encoding guarantees: –If upto t errors in output, will detect error –If upto c<t errors in output, can decode correct output matrix
CS717 Outline Linear Error Correcting Codes Algorithm-Based Fault Tolerance ABFT = Linear Encoding of Matrices
CS717 Error Correcting Codes Map f: k n –k-long data words n-long codewords –We use ={0, 1} Code of length n is a “sparse” subset of n –Very few possible words are valid codewords Rate of code Amount of information communicated by each codeword
CS717 Minimum Distance Minimum Distance: d() = Hamming distance Hamming distance: number of spots where words differ Measures difficulty of decoding/correcting corrupted codewords
CS717 Detection and Correction Code may detect errors in d min spots –No error can morph one codeword into another May correct errors in (d min -1)/2 spots –Can still find “closest” codeword More details later… Each codeword defines circle around itself of radius d min /2
CS717 Linear Codes Codewords form linear subspace inside n In rowspace of generator matrix G: a (n=7, k=3) code
CS717 Property 1 Linear combination of any codewords is also a codeword: For any x,y C, (x+y) C Codeword*constant is codeword For any z C, k*z C always a codeword Proof: basic properties of linear spaces
CS717 Property 2 Minimum distance of linear code = Where Proof:
CS717 Parity Check Matrix H: dual matrix to G –Contains basis of space orthogonal to G’s row space –n-k dimentional space H is (n-k)xn Space defined as: Note: H also defines a linear code
CS717 Property 3 d min =min # of columns of H that can sum to 0 Proof:
CS717 Property 4 Minimum distance of linear code n-k+1 Proof –Total n dimensions (since codewords are n-vectors) –G’s rowspace rank = k –Thus, H’s columspace rank = n-k –Thus, n-k+1 columns will be linearly dependent Add up to 0 –By Property 3, this is d min
CS717 Outline Linear Error Correcting Codes Algorithm-Based Fault Tolerance ABFT = Linear Encoding of Matrices
CS717 Encoding a Matrix Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984 Encode each row of matrix via extra column Column entries = sums of matrix rows
CS717 Encoding a Matrix Encode each column of matrix via extra row Row entries = sums of matrix columns Full Encoding:
CS717 Detecting Errors Suppose matrix A is corrupted to matrix  –entry â i,j is wrong Can detect error’s exact position:
CS717 Correcting Errors Can correct error using row or col checksum
CS717 Big Trick: Preservation of Encoding Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx Can check MMM computation by checking encoding of output If product matrix has an erroneous entry –Can detect –Can correct
CS717 Applications Matrix Multiplication –Given encoded A and B, –Check whether MMM result C (?=AB) has valid encoding Matrix Factorization –Given a factorization A=WZ –Verify correctness by verifying encodings of factors Factors row- OR column-encoded Can only detect, not correct errors
CS717 Weighted ABFT Oftentimes need to check row- or column- encoded matrices –Ex: factorization, data integrity check Can only detect errors in such matrices Can we also correct? Yes, by generalizing to weighted checking rows/columns
CS717 Weighting Suppose we have d n-vectors w 1 …w d Can column-encode matrix A: Lets try out:
CS717 Weighted Error Detection
CS717 Weighted Error Correction Weighted encoding Detects and Corrects single errors –Even for non full-encoding
CS717 Outline Linear Error Correcting Codes Algorithm-Based Fault Tolerance ABFT = Linear Encoding of Matrices
CS717 “Surprise” But this is all just a linear code! Generator matrix for above scheme:
CS717 Generating Encodings Given m= as message word (or matrix row/column)
CS717 Surprise?? Not too surprising really Why else would MMM preserve encoding? Another possibility: –Efficient: can be implemented via bit shifts Room open for using any linear code!
CS717 Error Detection/Correction in General To show for linear codes: –Can detect d min errors –Can correct (d min -1)/2 errors Let be original codeword Let be the corrupted codeword –e: error vector
CS717 Error Detection in General –s called the “syndrome vector” –Independent of original codeword Note: weight(e) <d min since <d min errors Thus: Detection: if, then ERROR
CS717 Error Correction in General Clearly e is correction vector – corrects error in Sufficient to prove: weight(e) (d min -1)/2 H is isomorphism: correction vectors syndrome vectors –i.e. for each correction vector (want to know) unique syndrome vector Thus, possible to correct any error –may not be efficient
CS717 H is Onto weight(e) (d min -1)/2 < d min rank(H) = n-k (d min -1)/2 Thus, rank(H) weight(e) and He 0 –Not enough 1’s in e to sum H’s columns to 0 H maps onto its range Thus,
CS717 H is 1-1 Let e 1 and e 2 be correction vectors, e 1 e 2 Suppose that: –weight(e 1 &e 2 ) (d min -1)/2 –He 1 = He 2 = s He 1 -He 2 = H(e 1 -e 2 ) = s-s = 0 And so, (e 1 -e 2 ) is a codeword Thus, weight(e 1 -e 2 ) d min But weight(e 1 &e 2 ) (d min -1)/2 and so weight(e 1 -e 2 ) d min -1 Contradiction! e 1 = e 2
CS717 Other Encoding Schemes Linear codes preserved by matrix multiplication Presumably, fancier codes might be preserved by fancier computations Limit: –S. Winograd showed in 1962 that any code s.t. f(x y) = f(x) f(y) has rate (k/n) or minimum weight 0 as k How general can we get? Do good solutions exist for small k? –k=64 bits should be good enough
CS717 Summary For Matrix Multiplication can encode input via linear codes Solutions exist for more complex codes –Ex: Fourier Transforms On parallel systems must ensure: –No processor touches >1 element per row/column –Else, if one processor fails, encoding overwhelmed with errors –To ensure this must modify algorithm Separate check placement theory