1 Special Purpose Hardware for Factoring: The Linear Algebra Part Eran Tromer and Adi Shamir Applied Math Dept The Weizmann Institute of Science SHARCS.

Slides:

Advertisements

Similar presentations

Shortest Vector In A Lattice is NP-Hard to approximate

Advertisements

Origins  clear a replacement for DES was needed Key size is too small Key size is too small The variants are just patches The variants are just patches.

Cryptography, Attacks and Countermeasures Lecture 3 - Stream Ciphers

Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.

CS 483 – SD SECTION BY DR. DANIYAL ALGHAZZAWI (3) Information Security.

Improved Attacks on Multiple Encryption Adi Shamir The Weizmann Institute Israel Joint with Itai Dinur, Orr Dunkelman, and Nathan Keller.

Cryptography and Network Security Chapter 5 Fifth Edition by William Stallings Lecture slides by Lawrie Brown.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Cryptography and Network Security Chapter 3

Lecture 8: Primality Testing and Factoring Piotr Faliszewski

Data Encryption Standard (DES)

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,

Computational problems, algorithms, runtime, hardness

Session 4 Asymmetric ciphers.

Session 5 Hash functions and digital signatures. Contents Hash functions – Definition – Requirements – Construction – Security – Applications 2/44.

Cryptography and Network Security

Session 6: Introduction to cryptanalysis part 2. Symmetric systems The sources of vulnerabilities regarding linearity in block ciphers are S-boxes. Example.

AES clear a replacement for DES was needed

Factoring 1 Factoring Factoring 2 Factoring  Security of RSA algorithm depends on (presumed) difficulty of factoring o Given N = pq, find p or q and.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

1 Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer.

Oded Regev Tel-Aviv University On Lattices, Learning with Errors, Learning with Errors, Random Linear Codes, Random Linear Codes, and Cryptography and.

1 Hardware-Based Implementations of Factoring Algorithms Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer Analysis of Bernstein’s.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

McGraw-Hill©The McGraw-Hill Companies, Inc., Security PART VII.

Cryptography and Network Security Chapter 5. Chapter 5 –Advanced Encryption Standard "It seems very simple." "It is very simple. But if you don't know.

Cryptography and Network Security Chapter 5 Fourth Edition by William Stallings.

Session 6: Introduction to cryptanalysis part 1. Contents Problem definition Symmetric systems cryptanalysis Particularities of block ciphers cryptanalysis.

Lecture 23 Symmetric Encryption

1 Hardware Assisted Solution of Huge Systems of Linear Equations Adi Shamir Computer Science and Applied Math Dept The Weizmann Institute of Science Joint.

1 NTRU: A Ring-Based Public Key Cryptosystem Jeffrey Hoffstein, Jill Pipher, Joseph H. Silverman LNCS 1423, 1998.

Radial Basis Function Networks

Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.

Lecture 5 Overview Does DES Work? Differential Cryptanalysis Idea – Use two plaintext that barely differ – Study the difference in the corresponding.

1 Hardware-Based Implementations of Factoring Algorithms Factoring Estimates for a 1024-Bit RSA Modulus A. Lenstra, E. Tromer, A. Shamir, W. Kortsmit,

Chapter 12 Cryptography (slides edited by Erin Chambers)

Cryptanalysis. The Speaker  Chuck Easttom  

1 AN EFFICIENT METHOD FOR FACTORING RABIN SCHEME SATTAR J ABOUD 1, 2 MAMOUN S. AL RABABAA and MOHAMMAD A AL-FAYOUMI 1 1 Middle East University for Graduate.

Chapter 5 Advanced Encryption Standard. Origins clear a replacement for DES was needed –have theoretical attacks that can break it –have demonstrated.

1 University of Palestine Information Security Principles ITGD 2202 Ms. Eman Alajrami 2 nd Semester

The Data Encryption Standard - see Susan Landau’s paper: “Standing the test of time: the data encryption standard.” DES - adopted in 1977 as a standard.

Chapter 5 –Advanced Encryption Standard "It seems very simple." "It is very simple. But if you don't know what the key is it's virtually indecipherable."

Block ciphers 2 Session 4. Contents Linear cryptanalysis Differential cryptanalysis 2/48.

 Let A and B be any sets A binary relation R from A to B is a subset of AxB Given an ordered pair (x, y), x is related to y by R iff (x, y) is in R. This.

1 Growth of Functions CS 202 Epp, section ??? Aaron Bloomfield.

Copyright, Yogesh Malhotra, PhD, 2013www.yogeshmalhotra.com SPECIAL PURPOSE FACTORING ALGORITHMS Special Purpose Factoring Algorithms For special class.

9/17/15UB Fall 2015 CSE565: S. Upadhyaya Lec 6.1 CSE565: Computer Security Lecture 6 Advanced Encryption Standard Shambhu Upadhyaya Computer Science &

Rijndael Advanced Encryption Standard. Overview Definitions Definitions Who created Rijndael and the reason behind it Who created Rijndael and the reason.

SNFS versus (G)NFS and the feasibility of factoring a 1024-bit number with SNFS Arjen K. Lenstra Citibank, New York Technische Universiteit Eindhoven.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.

March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

Scott CH Huang COM 5336 Cryptography Lecture 6 Public Key Cryptography & RSA Scott CH Huang COM 5336 Cryptography Lecture 6.

Lecture 23 Symmetric Encryption

COMP 424 Lecture 04 Advanced Encryption Techniques (DES, AES, RSA)

Advanced Encryption Standard Dr. Shengli Liu Tel: (O) Cryptography and Information Security Lab. Dept. of Computer.

DES Analysis and Attacks CSCI 5857: Encoding and Encryption.

DATA & COMPUTER SECURITY (CSNB414) MODULE 3 MODERN SYMMETRIC ENCRYPTION.

1/6/20161 CS 3343: Analysis of Algorithms Lecture 2: Asymptotic Notations.

CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Solving Weakened Cryptanalysis Problems for the Bivium Keystream Generator in the Volunteer Computing Project Oleg Zaikin, Alexander Semenov,

Homework 3 As announced: not due today 

Digital Signature Schemes and the Random Oracle Model

Objective of This Course

PART VII Security.

Factoring RSA Moduli: Current State of the Art J

Presentation transcript:

1 Special Purpose Hardware for Factoring: The Linear Algebra Part Eran Tromer and Adi Shamir Applied Math Dept The Weizmann Institute of Science SHARCS 2005

2 Cryptanalysis is Evil A simple mathematical proof: The definition of throughput cost implies that cryptanalysis = time x money Since time = money, cryptanalysis = money 2 Since money is the root of evil, money 2 =evil Consequently, cryptanalysis = evil

3 Hardware cryptanalysis is the last frontier: Government organizations used to dominate cryptography, but over the last 25 years they faced increasing competition from academia and industry: Public key cryptosystems Protocols and modes of operation Secret key cryptosystems Algorithmic cryptanalysis Hardware cryptanalysis

4 Since this is the first workshop on cryptanalytic hardware, we have to consider the scope of the field: Totally impractical hardware designs? Quantum computers? Other nonstandard models of computation? Optimized implementations of cryptosystems? Small improvements in constants? Low level hardware details? Attack hardware or hardware attacks?

5 Some of my personal opinions: Near term practicality is crucial Constants are important Ideal cases: marginal schemes, awkward on PC General tools are better than specific hardware Funding for practical research can be sensitive

6 Which directions look promising? Optimizing some general tools in cryptanalysis: Improving time/memory tradeoff attacks (?) Implementing lattice reduction algorithms Applying transforms such as Walsh or FFT Solving large systems of sparse linear equations Computing Grobner bases with hardware assistance

7 Which directions look promising? Cryptanalysis of block ciphers: In the case of schemes with 40 bit keys or DES, exhaustive search machines were very useful These days are over: modern block ciphers such as the AES are very unlikely to be broken by pure hardware optimizations of exhaustive search Faster hardware implementations of differential and linear cryptanalysis do not address the real bottleneck, which is the huge amount of chosen/known plaintexts

8 Which directions look promising? Promising research directions in block ciphers deal primarily with cryptanalytic countermeasures: Searching for improved differentials and linear hulls in order to prove the strength of the scheme against such attacks Choosing S-boxes which satisfy certain criteria Finding the minimal bit-slice implementations of S-boxes (e.g., to avoid cache timing attacks)

9 Which directions look promising? Stream ciphers tend to be weaker, and there are several ways to use hardware assistance in their cryptanalysis: Applying fast correlation attack Finding sparse multiples of the recursion polynomial Trying many possible clock control sequences Applying algebraic attacks

10 Which directions look promising? Hash algorithm collisions are a hot new topic: Hardware can assist in finding good message modification patterns, but software is easier to tweak and PC’s seem to be sufficiently powerful It is too early to decide whether special purpose hardware can help in solving the generated systems of boolean conditions

11 Which directions look promising? Public key algorithms seem to be the best targets for hardware-assisted attacks: The security of many number theoretic schemes is marginal, and it is difficult to make them much stronger without making the keys too long or the scheme too slow Improved attacks are discovered quite frequently, and their hardware implementations are less obvious This is well demonstrated by the talks here

12 Which directions look promising? Some of the lesser known public key schemes may be vulnerable to hardware attacks: Algebraic schemes such as NTRU may be attacked by better implementations of lattice reduction algorithms Multivariate public key schemes such as HFE may be attacked by better implementations of Grobner basis algorithms

13 We now turn to our main topic, which is an efficient hardware implementation of modern factoring algorithms. This talk will deal with the linear algebra part, and will present an improved version of the hardware first proposed in: Analysis of Bernstein’s Factorization Circuit Arjen Lenstra, Adi Shamir, Jim Tomlinson, Eran Tromer ASIACRYPT, December 2002

14 Integer Factorization RSA and many other cryptographic schemes rely on the hardness of factorization, by using n as a public key and p,q as the secret key. The size of the public key determines the efficiency and the security of the scheme. We’ll concentrate on 1024 bit keys, which are the vast majority of the keys in use today. Product n=p £ q Large primes p,q multiplication is easy factorization is hard

15 Previous estimates of the cost of factoring a 1024-bit RSA key within 1 year were around a trillion dollars: Traditional PC-based: [Silverman 2000] 100M PCs with 170GB RAM each: $ 5£10 12 TWINKLE: [Lenstra,Shamir 2000, Silverman 2000] * 3.5M TWINKLEs and 14M simple PCs: ~ $ 10 11

16 Improved Factorization Results Can be derived from: Non-conventional computation models (unlimited word length, quantum computers, …) New conventional devices (PC’s, massively parallel computers, wafer scale integration, optoelectronics, …) Improved performance of standard devices (Moore’s law,…) New factoring algorithms (Pollard’s Rho method, Quadratic Sieve, Number Field Sieve, …) Optimized implementations of known algorithms on standard computation devices

17 Bicycle chain sieve [D. H. Lehmer, 1928]

18 The Number Field Sieve (NFS) Integer Factorization Algorithm The best algorithm known for factoring RSA keys, similar in structure to the quadratic sieve. Better Subexponential time. Successfully factored a 512-bit integer (RSA-155) in 1999 (using hundreds of workstations running for many months). Latest record achieved in 2004: 576-bit integer (RSA-174).

19 Simplified NFS – main parts Sieving step (relation collection): Find many B-smooth integers which factor completely into a product of primes which are smaller than some bound B. (This is the harder part, described in the next talk) Matrix step: Find a linear relationship between the factorizations of those integers. (This is the easier part, described here.)

20 The basic idea (shared with many previous factoring algorithms) To factor n : Find “random” r 1,r 2 such that r 1 2  r 2 2 (mod n) Hope that gcd(r 1 -r 2,n) is a nontrivial factor of n. How? Let f 1 (a) = a 2 mod n These values are squares mod n, but not over Z Find a nonempty set S ½ Z such that the product of the values of f 1 (a) for a 2 S is a square over the integers This gives us the two “independent” representations r 1 2  r 2 2 (mod n)

21 The basic idea (cont.) How to find S such that is a square? Look at the factorization of smooth f 1 (a) which factor completely into a product of small primes : f 1 (0)= 102 f 1 (1)= 33 f 1 (2)= 1495 f 1 (3)= 84 f 1 (4)= 616 f 1 (5)= 145 f 1 (6)= 42  This is a square, because all exponents are even = 113 = = 732 = = 295 = 732 = 

22 The basic idea (cont.) How to find S such that is a square? Consider all the primes smaller than some bound B, and find many integers a for which f 1 (a) is B -smooth. For each such a, represent the factorization of f 1 (a) as a vector of b exponents: f 1 (a)=2 e 1 3 e 2 5 e 3 7 e 4  (e 1,e 2,...,e b ) Once b+1 such vectors are found, find a dependency modulo 2 among them. That is, find S such that =2 e 1 3 e 2 5 e 3 7 e 4  where the e i are all even. Sieving step Matrix step

23 The matrix step We look for elements from the kernel of a sparse matrix over GF(2). Using Wiedemann’s algorithm, this can be reduced to the following: Input: a D x D binary matrix A and a binary D -vector v. Output: the first few bits of each of the vectors Av, A 2 v, A 3 v,..., A D v (mod 2). D is huge (between 100 million and 10 billion), but the matrix is very sparse ( e.g.,  100 ones in each row)

24 Observations [Bernstein 2001] On a single-processor computer, storage dominates cost yet is poorly utilized. Sharing the input among multiple processors can lead to collisions, propagation delays. Solution: use a mesh-based device, with a small processor attached to each storage cell. Bernstein proposed an algorithm based on mesh sorting.

25 Matrix-by-vector multiplication X= ? ? ? ? ? ? ? ? ? ? Σ (mod 2)

26 Is the mesh sorting idea optimal? The fastest known mesh sorting algorithm for a mxm mesh requires about 3m steps, but it is too complicated to implement on simple processors. Bernstein used the Schimmler mesh sorting algorithm three times to complete each matrix- vector product. For a mxm matrix, this requires about 24m steps. We proposed to replace the mesh sorting by mesh routing, which is both simpler and faster. We use only one full routing, which requires 2m steps.

27 A routing-based circuit for the matrix step [Lenstra,Shamir,Tomlinson,Tromer 2002] Model: two-dimensional mesh, nodes connected to · 4 neighbours. Preprocessing: load the non-zero entries of A into the mesh, one entry per node. The entries of each column are stored in a square block of the mesh, along with a “target cell” for the corresponding vector bit.

28 Operation of the routing-based circuit To perform a multiplication: Initially the target cells contain the vector bits. These are locally broadcast within each block (i.e., within the matrix column). A cell containing a row index i that receives a “1” emits an value (which corresponds to a at row i ). Each value is routed to the target cell of the i -th block (which is collecting ‘s for row i ). Each target cell counts the number of values it received. That’s it! Ready for next iteration i i i

29 How to perform the routing? If the original sparse matrix A has size DxD, we have to fold the D vector entries into a mxm mesh where m=sqrt(D). Routing dominates cost, so the choice of algorithm (time, circuit area) is critical. There is extensive literature about mesh routing. Examples: Bounded-queue-size algorithms Hot-potato routing Off-line algorithms None of these are ideal.

30 Clockwise transposition routing on the mesh Very simple schedule, can be realized implicitly by a pipeline. Pairwise annihilation. Worst-case: m 2 Average-case: ? Experimentally: 2m steps suffice for random inputs – optimal. The point: m 2 values handled in time O(m). [Bernstein] One packet per cell. Only pairwise compare-exchange operations ( ). Compared pairs are swapped according to the preference of the packet that has the farthest to go along this dimension.

31 Comparison to Bernstein’s design Time: A single routing operation (2m steps) vs. 3 sorting operations (8m steps each). Circuit area: Only the move; the matrix entries don’t. Simple routing logic and small routed values Matrix entries compactly stored in DRAM (~1/100 the area of “active” storage) 1/12 1/3 i

32 Further improvements Reduce the number of cells in the mesh (for small μ, decreasing #cells by a factor of μ decreases throughput cost by ~μ 1/2 ) Use Coppersmith’s block Wiedemann Execute the separate multiplication chains of block Wiedemann simultaneously on one mesh (for small K, reduces cost by ~K) Compared to Bernstein’s original design, this reduces the throughput cost by a constant factor 1/7 1/15 1/6 of 45,000.

33 Hardware Fault tolerance Any wafer scale design will contain defective cells. Cells found to be defective during the initial testing can be handled by modifying the routing algorithm.

34 Algorithmic fault tolerance Transient errors must be detected and corrected as soon as possible since they can lead to totally unusable results. This is particularly important in our design, since the routing is not guaranteed to stop after 2m steps, and packets may be dropped

35 How to detect an error in the computation of AxV (mod 2)? The original matrix-vector product:

36 How to detect an error in the computation of AxV (mod 2)? Sum of some matrix rows: Check bit

37 Problems with this solution: If the new test vector at the bottom is the sum mod 2 of few rows, it will remain sparse but have extremely small chance of detecting a transient error in one of the output bits If the new test vector at the bottom is the sum mod 2 of a random subset of rows, it will become dense, but still miss errors with constant probability

38 Problems with this solution: To reduce the probability of missing an error to a negligible value, we can add hundreds of dense test vectors derived from independent random subsets of the rows of the matrix. However, in this case the cost of the checking dominates the cost of the actual computation.

39 Our new solution: We add only one test vector, which adds only 1% to the cost of the computation, and still get negligible probability of missing transient errors We achieve it by using the fact that in Weidemann’s algorithm we compute all the products V 1 =AV, V 2 =A 2 V, V 3 =A 3 V, …V i =A i V,…V D =A D V

40 Our new solution: We choose a random row vector R, and precompute (on a reliable machine) the row vector W=RA k for some small k (e.g., k=200). We add to the matrix the two row vectors W and R as the only additional test vectors. Note that these vectors are no longer the sum of subsets of rows of the matrix A.

41 Our new solution: Consider now the following equation, which is true for all i: V i+k =A i+k V=A k A i V=A k V i Multiplying it on the left with R, we get that for all i: RV i+k =RA k V i =WV i Since we added the two vectors R and W to the matrix, we get for each vector V i the products RV i and WV i, which are two bits

42 Our new solution: We store the computed values of WV i in a short shift register of length k=200, and compare it after a delay k with the computed value of RV i+k We periodically store the current vector V i (say, once a day) in an off-line storage. If any one of the consistency tests fails, we stop the computation, test for faulty components, and restart from the last good V i

43 Why does it catch transient errors with overwhelming probability? Assume that at step i the algorithm computed the erroneous V i ‘=V i +E and that all the computations before and after step i were correct (wrt the wrong V i ‘) Due to the linearity of the computation, for any j>0: V ’ i+j =A j V ’ i =A j (V i +E)=A j V i +A j E=V i+j +A j E and thus the difference between the correct and erroneous V i develops as A j E from time i onwards

44 Why does it catch transient errors with overwhelming probability? Each test checks whether W(A j E)=0 for some j<k Since the matrix A generated by the number field sieve is random looking, its first k=200 powers are likely to be random and dense, and thus each test has an independent probability of 0.5 to fail. i First error detection No more detectable errors

45 -END OF PART ONE-