1 Hardware Assisted Solution of Huge Systems of Linear Equations Adi Shamir Computer Science and Applied Math Dept The Weizmann Institute of Science Joint.

Slides:



Advertisements
Similar presentations
1 A triple erasure Reed-Solomon code, and fast rebuilding Mark Manasse, Chandu Thekkath Microsoft Research - Silicon Valley Alice Silverberg Ohio State.
Advertisements

5.1 Real Vector Spaces.
Shortest Vector In A Lattice is NP-Hard to approximate
Lecture 19: Parallel Algorithms
Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.
Information and Coding Theory
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Numerical Algorithms Matrix multiplication
Infinite Horizon Problems
Computational problems, algorithms, runtime, hardness
Session 4 Asymmetric ciphers.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Lecture 21: Parallel Algorithms
Factoring 1 Factoring Factoring 2 Factoring  Security of RSA algorithm depends on (presumed) difficulty of factoring o Given N = pq, find p or q and.
Algorithm Design Techniques: Induction Chapter 5 (Except Section 5.6)
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Object (Data and Algorithm) Analysis Cmput Lecture 5 Department of Computing Science University of Alberta ©Duane Szafron 1999 Some code in this.
1 Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer.
1 Hardware-Based Implementations of Factoring Algorithms Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer Analysis of Bernstein’s.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Theory I Algorithm Design and Analysis (9 – Randomized algorithms) Prof. Dr. Th. Ottmann.
Hashing General idea: Get a large array
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
1 Special Purpose Hardware for Factoring: The Linear Algebra Part Eran Tromer and Adi Shamir Applied Math Dept The Weizmann Institute of Science SHARCS.
1 Hardware-Based Implementations of Factoring Algorithms Factoring Estimates for a 1024-Bit RSA Modulus A. Lenstra, E. Tromer, A. Shamir, W. Kortsmit,
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
May 29, 2008 GNFS polynomials Peter L. Montgomery Microsoft Research, USA 1 Abstract The Number Field Sieve is asymptotically the fastest known algorithm.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
 Let A and B be any sets A binary relation R from A to B is a subset of AxB Given an ordered pair (x, y), x is related to y by R iff (x, y) is in R. This.
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
1 Growth of Functions CS 202 Epp, section ??? Aaron Bloomfield.
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Copyright, Yogesh Malhotra, PhD, 2013www.yogeshmalhotra.com SPECIAL PURPOSE FACTORING ALGORITHMS Special Purpose Factoring Algorithms For special class.
Great Theoretical Ideas in Computer Science.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Information and Coding Theory Linear Block Codes. Basic definitions and some examples. Juris Viksna, 2015.
Analysis of Algorithms
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.
Introduction to Algorithms Second Edition by Cormen, Leiserson, Rivest & Stein Chapter 31.
Factorization of a 768-bit RSA modulus Jung Daejin Lee Sangho.
SNFS versus (G)NFS and the feasibility of factoring a 1024-bit number with SNFS Arjen K. Lenstra Citibank, New York Technische Universiteit Eindhoven.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Computing Eigen Information for Small Matrices The eigen equation can be rearranged as follows: Ax = x  Ax = I n x  Ax - I n x = 0  (A - I n )x = 0.
Great Theoretical Ideas in Computer Science.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Scott CH Huang COM 5336 Cryptography Lecture 6 Public Key Cryptography & RSA Scott CH Huang COM 5336 Cryptography Lecture 6.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Lattice-based cryptography and quantum Oded Regev Tel-Aviv University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Chapter 1 Algorithms with Numbers. Bases and Logs How many digits does it take to represent the number N >= 0 in base 2? With k digits the largest number.
Hardware Implementations of Finite Field Primitives
Real Zeros of Polynomial Functions
Page : 1 bfolieq.drw Technical University of Braunschweig IDA: Institute of Computer and Network Engineering  W. Adi 2011 Lecture-5 Mathematical Background:
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Public Key Encryption Major topics The RSA scheme was devised in 1978
Homework 3 As announced: not due today 
Lecture 22: Parallel Algorithms
Objective of This Course
Mathematical Background for Cryptography
Mathematical Background: Extension Finite Fields
Presentation transcript:

1 Hardware Assisted Solution of Huge Systems of Linear Equations Adi Shamir Computer Science and Applied Math Dept The Weizmann Institute of Science Joint work with Eran Tromer Hebrew University, 6/3/06

2 Cryptanalysis is Evil A simple mathematical proof: The definition of throughput cost implies that cryptanalysis = time x money Since time = money, cryptanalysis = money 2 Since money is the root of evil, money 25 =evil Consequently, cryptanalysis = evil

3 Cryptanalysis is Useful: Raises interesting mathematical problems Requires highly optimized algorithms Leads to new computer architectures Motivates the development of new technologies Stretches the limits on what’s doable

4 Motivation for this Talk: Integer Factorization and Cryptography RSA uses n as the public key and p,q as the secret key in encrypting and signing messages The size of the public key n determines the efficiency and the security of the scheme Product n=p £ q Large primes p,q multiplication is easy factorization is hard

5 Factorization Records: A 256 bit key used by French banks in 1980’s A 512 bit key factored in 1999 A 667 bit (200 digit) key factored in 2005 The big challenge: factoring 1024 bit keys

6 Improved Factorization Results Can be derived from: New algorithms for standard computers (which are better than the Number Field Sieve) Non-conventional computation models (unlimited word length, quantum computers) Faster standard devices (Moore’s law) New implementations of known algorithms on special purpose hardware (wafer scale integration, optoelectronics, …)

7 Bicycle chain sieve [D. H. Lehmer, 1928]

8 The Basic Idea of the Number Field Sieve (NFS) Factoring Algorithm: To factor n : Find “random” r 1,r 2 such that r 1 2  r 2 2 (mod n) Hope that gcd(r 1 -r 2,n) is a nontrivial factor of n. How? Let f 1 (a) = a 2 mod n These values are squares mod n, but not over Z Find a nonempty set S ½ Z such that the product of the values of f 1 (a) for a 2 S is a square over the integers This gives us the two “independent” representations r 1 2  r 2 2 (mod n)

9 Some Technical Details: How to find S such that is a square? Consider all the primes smaller than some bound B, and find many integers a for which f 1 (a) is B -smooth. For each such a, represent the factorization of f 1 (a) as a vector of b exponents: f 1 (a)=2 e 1 3 e 2 5 e 3 7 e 4  (e 1,e 2,...,e b ) Once b+1 such vectors are found, find a dependency modulo 2 among them. That is, find S such that =2 e 1 3 e 2 5 e 3 7 e 4  where the e i are all even. Sieving step Matrix step

10 A Simple example: We want to find a subset S such that is a square Look at the factorization of smooth f 1 (a) which factor completely into a product of small primes : f 1 (0)= 102 f 1 (1)= 33 f 1 (2)= 1495 f 1 (3)= 84 f 1 (4)= 616 f 1 (5)= 145 f 1 (6)= 42  This is a square, because all exponents are even = 113 = = 732 = = 295 = 732 = 

11 The Matrix Step in the Factorization of 1024 Bit Numbers can thus be Reduced to the following problem: Find a nonzero x satisfying Ax=0 over GF(2) where: A is huge (#rows/columns larger than 2 30 ) A is sparse (#1’s per row less than 100) A is “random” (with no usable structure) The algorithm can occasionally fail

12 Solving linear equations and finding kernel vectors are computationally equivalent problems: Let A’ be a non-singular matrix. Let A be the singular matrix defined by adding b as a new column and adding zeroes as a new row in A’. Consider any nonzero kernel vector x satisfying Ax=0. The last entry of x cannot be zero, so we can use the other entries x’ of x to describe b as a linear combination of the columns of A’, thus solving A’x’=b.

13 Solving linear equations and finding kernel vectors are computationally equivalent problems:

14 Standard Equation Solving Techniques Are Impractical: Elimination and factoring techniques quickly fill up the sparse initial matrix Approximation techniques do not work mod 2 Many specialized techniques do not work for random matrices Random access to matrix elements is too expensive Parallel processing is impractical unless restricted to 2D grids with local communication between neighbors

15 Practical limits on what we can do: We can use linear space to store the original sparse matrix (with 100x 2 30 “1” entries), along with several dense vectors, but not the quadratic number of entries in a dense matrix of this size (with 2 60 entries) We can use a quadratic time algorithm ( 2 60 steps) but not a cubic time algorithms ( 2 90 steps) to solve the linear algebra problem We have to use special purpose hardware and highly optimized algorithms

16 Wiedemann’s Algorithm to find a vector u in the kernel of a singular A: Theorem: Let p(x) be the characteristic polynomial of the matrix A. Then p(A)=0. Lemma: If A is singular, 0 is a root of p(x), and thus p(x)=xq(x). Corollary: Aq(A)=p(A)=0, and thus for any vector v, u=q(A)v is in the kernel of A.

17 Wiedemann’s Algorithm to find a vector u in the kernel of a singular A: We have thus reduced the problem of finding a kernel vector of a sparse A into the problem of finding its characteristic polynomial. The basic idea: Consider the infinite sequence of powers of A: A, A 2, A 3,..., A d,... (mod 2). If A has dimension dxd, we know that its characteristic polynomial has degree at most d, and thus the matrices in any window of length d satisfy the same linear recurrence relation mod 2.

18 Wiedemann’s Algorithm to find a vector u in the kernel of a singular A: The matrices are too large and dense, so we cannot compute and store them. To overcome this problem, pick a random vector v and compute the first bit of each vector Av, A 2 v, A 3 v,..., A d v,... (mod 2). These bits satisfy the same linear recurrence relation of size d (mod 2).

19 Wiedemann’s Algorithm to find a nonzero vector in the kernel of a singular A: Given a binary sequence of bits length d, we can find their shortest linear recurrence by using the Berlekamp-Massey algorithm, which runs in quadratic time using linear space. To compute the sequence of vectors Av, A 2 v, A 3 v,..., A d v,... (mod 2), we have to repeatedly compute the product of the sparse matrix A by the previous dense vector. Each matrix-vector product requires linear time and space. Since we have to compute d such products but retain only the top bit from each one of them, the whole computation requires quadratic time and linear space.

20 How to Perform the Sparse Matrix / Dense Vector Products Daniel Bernstein’s observations (2001): On a single-processor computer, storage dominates cost but is poorly utilized. Sharing the input among multiple processors can lead to collisions, propagation delays. Solution: use a mesh-based architecture with a small processor attached to each memory cell, and use a mesh sorting algorithm to perform the matrix/vector multiplications.

21 Matrix-by-vector multiplication X= ? ? ? ? ? ? ? ? ? ? Σ (mod 2)

22 Is the mesh sorting idea optimal? The fastest known mesh sorting algorithm for a mxm mesh requires about 3m steps, but it is too complicated to implement on simple processors. Bernstein used the Schimmler mesh sorting algorithm three times to complete each matrix- vector product. For a mxm matrix, this requires about 24m steps. We proposed to replace the mesh sorting by mesh routing, which is both simpler and faster. We use only one full routing, which requires 2m steps.

23 A routing-based circuit for the matrix step [Lenstra,Shamir,Tomlinson,Tromer 2002] Model: two-dimensional mesh, nodes connected to · 4 neighbours. Preprocessing: load the non-zero entries of A into the mesh, one entry per node. The entries of each column are stored in a square block of the mesh, along with a “target cell” for the corresponding vector bit.

24 Operation of the routing-based circuit To perform a multiplication: Initially the target cells contain the vector bits. These are locally broadcast within each block (i.e., within the matrix column). A cell containing a row index i that receives a “1” emits an value (which corresponds to a at row i ). Each value is routed to the target cell of the i -th block (which is collecting ‘s for row i ). Each target cell counts the number of values it received. That’s it! Ready for next iteration i i i

25 How to perform the routing? If the original sparse matrix A has size dxd, we have to fold the d vector entries into a mxm mesh where m=sqrt(d). Routing dominates cost, so the choice of algorithm (time, circuit area) is critical. There is extensive literature about mesh routing. Examples: Bounded-queue-size algorithms Hot-potato routing Off-line algorithms None of these are ideal.

26 Clockwise transposition routing on the mesh Very simple schedule, can be realized implicitly by a pipeline. Pairwise annihilation. Worst-case: m 2 Average-case: ? Experimentally: 2m steps suffice for random inputs – optimal. The point: m 2 values handled in time O(m). [Bernstein] One packet per cell. Only pairwise compare-exchange operations ( ). Compared pairs are swapped according to the preference of the packet that has the farthest to go along this dimension.

27 Comparison to Bernstein’s design Time: A single routing operation (2m steps) vs. 3 sorting operations (8m steps each). Circuit area: Only the move; the matrix entries don’t. Simple routing logic and small routed values Matrix entries compactly stored in DRAM (~1/100 the area of “active” storage) 1/12 1/3 i

28 Further improvements Reduce the number of cells in the mesh (for small μ, decreasing #cells by a factor of μ decreases throughput cost by ~μ 1/2 ) Use Coppersmith’s block Wiedemann Execute the separate multiplication chains of block Wiedemann simultaneously on one mesh (for small K, reduces cost by ~K) Compared to Bernstein’s original design, this reduces the throughput cost by a constant factor 1/7 1/15 1/6 of 45,000.

29 Does it always work? We simulated the algorithm thousands of time with large random mxm meshes, and it always routed the data correctly within 2m steps We failed to prove this (or any other) upper bound on the running time of the algorithm. We found a particular input for which the routing algorithm failed by entering a loop.

30 Hardware Fault tolerance Any wafer scale design will contain defective cells. Cells found to be defective during the initial testing can be handled by modifying the routing algorithm.

31 Algorithmic fault tolerance Transient errors must be detected and corrected as soon as possible since they can lead to totally unusable results. This is particularly important in our design, since the routing is not guaranteed to stop after 2m steps, and packets may be dropped

32 How to detect an error in the computation of AxV (mod 2)? The original matrix-vector product:

33 How to detect an error in the computation of AxV (mod 2)? Sum of some matrix rows: Check bit

34 Problems with this solution: If the new test vector at the bottom is the sum mod 2 of few rows, it will remain sparse but have extremely small chance of detecting a transient error in one of the output bits If the new test vector at the bottom is the sum mod 2 of a random subset of rows, it will become dense, but still miss errors with constant probability

35 Problems with this solution: To reduce the probability of missing an error to a negligible value, we can add hundreds of dense test vectors derived from independent random subsets of the rows of the matrix. However, in this case the cost of the checking dominates the cost of the actual computation.

36 Our new solution: We add only one test vector, which adds only 1% to the cost of the computation, and still get negligible probability of missing transient errors We achieve it by using the fact that in Weidemann’s algorithm we compute all the products V 1 =AV, V 2 =A 2 V, V 3 =A 3 V, …V i =A i V,…V D =A D V

37 Our new solution: We choose a random row vector R, and precompute (on a reliable machine) the row vector W=RA k for some small k (e.g., k=200). We add to the matrix the two row vectors W and R as the only additional test vectors. Note that these vectors are no longer the sum of subsets of rows of the matrix A.

38 Our new solution: Consider now the following equation, which is true for all i: V i+k =A i+k V=A k A i V=A k V i Multiplying it on the left with R, we get that for all i: RV i+k =RA k V i =WV i Since we added the two vectors R and W to the matrix, we get for each vector V i the products RV i and WV i, which are two bits

39 Our new solution: We store the computed values of WV i in a short shift register of length k=200, and compare it after a delay k with the computed value of RV i+k We periodically store the current vector V i (say, once a day) in an off-line storage. If any one of the consistency tests fails, we stop the computation, test for faulty components, and restart from the last good V i

40 Why does it catch transient errors with overwhelming probability? Assume that at step i the algorithm computed the erroneous V i ‘=V i +E and that all the computations before and after step i were correct (wrt the wrong V i ‘) Due to the linearity of the computation, for any j>0: V ’ i+j =A j V ’ i =A j (V i +E)=A j V i +A j E=V i+j +A j E and thus the difference between the correct and erroneous V i develops as A j E from time i onwards

41 Why does it catch transient errors with overwhelming probability? Each test checks whether W(A j E)=0 for some j<k Since the matrix A generated by the number field sieve is random looking, its first k=200 powers are likely to be random and dense, and thus each test has an independent probability of 0.5 to fail. i First error detection No more detectable errors

42 -END OF PART ONE-