1 Hardware-Based Implementations of Factoring Algorithms Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer Analysis of Bernstein’s.

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Page Replacement Algorithms
Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Fast Algorithms For Hierarchical Range Histogram Constructions
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Data Parallel Algorithms Presented By: M.Mohsin Butt
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Factoring 1 Factoring Factoring 2 Factoring  Security of RSA algorithm depends on (presumed) difficulty of factoring o Given N = pq, find p or q and.
CSE 830: Design and Theory of Algorithms
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
1 Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer.
1 Special Purpose Hardware for Factoring: the NFS Sieving Step Adi Shamir Eran Tromer Weizmann Institute of Science.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
1 Hardware Assisted Solution of Huge Systems of Linear Equations Adi Shamir Computer Science and Applied Math Dept The Weizmann Institute of Science Joint.
1 Special Purpose Hardware for Factoring: The Linear Algebra Part Eran Tromer and Adi Shamir Applied Math Dept The Weizmann Institute of Science SHARCS.
1 Hardware-Based Implementations of Factoring Algorithms Factoring Estimates for a 1024-Bit RSA Modulus A. Lenstra, E. Tromer, A. Shamir, W. Kortsmit,
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
May 29, 2008 GNFS polynomials Peter L. Montgomery Microsoft Research, USA 1 Abstract The Number Field Sieve is asymptotically the fastest known algorithm.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
October,2006 Higher- Degree Polynomials Peter L. Montgomery Microsoft Research and CWI 1 Abstract The Number Field Sieve is asymptotically the fastest.
Copyright, Yogesh Malhotra, PhD, 2013www.yogeshmalhotra.com SPECIAL PURPOSE FACTORING ALGORITHMS Special Purpose Factoring Algorithms For special class.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Author: Haoyu Song, Fang Hao, Murali Kodialam, T.V. Lakshman Publisher: IEEE INFOCOM 2009 Presenter: Chin-Chung Pan Date: 2009/12/09.
Implementation of Finite Field Inversion
Hashing Table Professor Sin-Min Lee Department of Computer Science.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
SNFS versus (G)NFS and the feasibility of factoring a 1024-bit number with SNFS Arjen K. Lenstra Citibank, New York Technische Universiteit Eindhoven.
Comp 335 File Structures Hashing.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Structure Introduction.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
A Small IP Forwarding Table Using Hashing Yeim-Kuan Chang and Wen-Hsin Cheng Dept. of Computer Science and Information Engineering National Cheng Kung.
Chapter One Introduction to Pipelined Processors
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Analysis of algorithms. What are we going to learn? Need to say that some algorithms are “better” than others Criteria for evaluation Structure of programs.
Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 2 A Taste of Parallel Algorithms Learn about the nature of parallel algorithms and complexity:
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
A low cost quantum factoring algorithm
Cache Memory Presentation I
Factoring RSA Moduli: Current State of the Art J
CENG 351 Data Management and File Structures
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
Lecture-Hashing.
Presentation transcript:

1 Hardware-Based Implementations of Factoring Algorithms Factoring Large Numbers with the TWIRL Device Adi Shamir, Eran Tromer Analysis of Bernstein’s Factorization Circuit Arjen Lenstra, Adi Shamir, Jim Tomlinson, Eran Tromer

2 Bicycle chain sieve [D. H. Lehmer, 1928]

3 The Number Field Sieve Integer Factorization Algorithm Best algorithm known for factoring large integers. Subexponential time, subexponential space. Successfully factored a 512-bit RSA key (hundreds of workstations running for many months). Record: 530-bit integer (RSA-160, 2003). Factoring 1024-bit: previous estimates were trillions of $  year. Our result: a hardware implementation which can factor 1024-bit composites at a cost of about 10M $  year.

4 NFS – main parts Relation collection (sieving) step: Find many integers satisfying a certain (rare) property. Matrix step: Find an element from the kernel of a huge but sparse matrix.

5 Previous works: 1024-bit sieving Cost of completing all sieving in 1 year: Traditional PC-based: [Silverman 2000] 100M PCs with 170GB RAM each: $ 5  TWINKLE: [Lenstra,Shamir 2000, Silverman 2000] * 3.5M TWINKLEs and 14M PCs: ~ $ Mesh-based sieving [Geiselmann,Steinwandt 2002] * Millions of devices, $ to $ (if at all?) Multi-wafer design – feasible? New device: $10M

6 Previous works: 1024-bit matrix step Cost of completing the matrix step in 1 year: Serial: [Silverman 2000] 19 years and 10,000 interconnected Crays. Mesh sorting [Bernstein 2001, LSTT 2002] 273 interconnected wafers – feasible?! $4M and 2 weeks. New device: $0.5M

7 Review: the Quadratic Sieve To factor n : Find “random” r 1,r 2 such that r 1 2  r 2 2 (mod n) Hope that gcd(r 1 -r 2,n) is a nontrivial factor of n. How? Let f 1 (a)=(a+ b n 1/2 c ) 2 – n f 2 (a)=(a+ b n 1/2 c ) 2 Find a nonempty set S ½ Z such that over Z for some r 1,r 2 2 Z. r 1 2  r 2 2 (mod n)

8 The Quadratic Sieve (cont.) How to find S such that is a square? Look at the factorization of f 1 (a): f 1 (0)= 102 f 1 (1)= 33 f 1 (2)= 1495 f 1 (3)= 84 f 1 (4)= 616 f 1 (5)= 145 f 1 (6)= 42  This is a square, because all exponents are even = 113 = = 732 = = 295 = 732 = 

9 The Quadratic Sieve (cont.) How to find S such that is a square? Consider only the  (B) primes smaller than a bound B. Search for integers a for which f 1 (a) is B -smooth. For each such a, represent the factorization of f 1 (a) as a vector of b exponents: f 1 (a)=2 e 1 3 e 2 5 e 3 7 e 4  (e 1,e 2,...,e b ) Once b+1 such vectors are found, find a dependency modulo 2 among them. That is, find S such that =2 e 1 3 e 2 5 e 3 7 e 4  where e i all even. Relation collection step Matrix step

10 Observations [Bernstein 2001] The matrix step involves multiplication of a single huge matrix (of size subexponential in n ) by many vectors. On a single-processor computer, storage dominates cost yet is poorly utilized. Sharing the input: collisions, propagation delays. Solution: use a mesh-based device, with a small processor attached to each storage cell. Devise an appropriate distributed algorithm. Bernstein proposed an algorithm based on mesh sorting. Asymptotic improvement: at a given cost you can factor integers that are 1.17 longer, when cost is defined as throughput cost = run time X construction cost AT cost =

11 Implications? The expressions for asymptotic costs have the form e (α+o ( 1 ) )·(log n) 1/3 ·(log log n) 2/3. Is it feasible to implement the circuits with current technology? For what problem sizes? Constant-factor improvements to the algorithm? Take advantage of the quirks of available technology? What about relation collection?

12 The Relation Collection Step Task: Find many integers a for which f 1 (a) is B -smooth (and their factorization). We look for a such that p | f 1 (a) for many large p : Each prime p “hits” at arithmetic progressions: where r i are the roots modulo p of f 1. (there are at most deg(f 1 ) such roots, ~1 on average).

13 The Sieving Problem Input: a set of arithmetic progressions. Each progression has a prime interval p and value log p. OOO OOO OOOOO OOOOOOOOO OOOOOOOOOOOO Output: indices where the sum of values exceeds a threshold. (there is about one progression for every prime p smaller than 10 8 )

14 Three ways to sieve your numbers... O O O 23 O 19 O 17 OO 13 OOO 11 OOO 7 OOOOO 5 OOOOOOOOO 3 OOOOOOOOOOOO primes indices ( a values)

15 Time O O O 23 O 19 O 17 OO 13 OOO 11 OOO 7 OOOOO 5 OOOOOOOOO 3 OOOOOOOOOOOO Memory One contribution per clock cycle. Serial sieving, à la Eratosthenes 276–194 BC

16 Counters TWINKLE: time-space reversal O O O 23 O 19 O 17 OO 13 OOO 11 OOO 7 OOOOO 5 OOOOOOOOO 3 OOOOOOOOOOOO Time One index handled at each clock cycle.

17 Various circuits TWIRL: compressed time O O O 23 O 19 O 17 OO 13 OOO 11 OOO 7 OOOOO 5 OOOOOOOOO 3 OOOOOOOOOOOO Time s=5 indices handled at each clock cycle. (real: s=32768 )

Parallelization in TWIRL TWINKLE-like pipeline a =0,1,2, …

19 Parallelization in TWIRL TWINKLE-like pipeline Simple parallelization with factor s a =0, s,2 s, … TWIRL with parallelization factor s a =0, s,2 s, … a =0,1,2, … s-1s-1 3 s+1s+1 s+2s+2 2s-12s-1 s+3s+3 s s-1s-1 3 s+1s+1 s+2s+2 2s-12s-1 s+3s+3 s s-1s-1 3 2s2s 2 s +1 2 s +2 2 s +3 3 s -1

20 Heterogeneous design A progression of interval p i makes a contribution every p i /s clock cycles. There are a lot of large primes, but each contributes very seldom. There are few small primes, but their contributions are frequent. We place numerous “stations” along the pipeline. Each station handles progressions whose prime interval are in a certain range. Station design varies with the magnitude of the prime.

21 Example: handling large primes Primary consideration: efficient storage between contributions. Each memory+processor unit handle many progressions. It computes and sends contributions across the bus, where they are added at just the right time. Timing is critical. Memory Processor Memory Processor

22 Handling large primes (cont.) Memory Processor

23 Handling large primes (cont.) The memory contains a list of events of the form ( p i, a i ), meaning “a progression with interval p i will make a contribution to index a i ”. Goal: simulate a priority queue. 1. Read next event ( p i, a i ). 2. Send a log p i contribution to line a i ( mod s ) of the pipeline. 3. Update a i à a i +p i 4. Save the new event ( p i, a i ) to the memory location that will be read just before index a i passes through the pipeline. To handle collisions, slacks and logic are added. The list is ordered by increasing a i. At each clock cycle:

24 Handling large primes (cont.) The memory used by past events can be reused. Think of the processor as rotating around the cyclic memory: Processor

25 Handling large primes (cont.) The memory used by past events can be reused. Think of the processor as rotating around the cyclic memory: By appropriate choice of parameters, we guarantee that new events are always written just behind the read head. There is a tiny (1:1000) window of activity which is “twirling” around the memory bank. It is handled by an SRAM-based cache. The bulk of storage is handled in compact DRAM. Processor

26 Rational vs. algebraic sieves We actually have two sieves: rational and algebraic. We are looking for the indices that accumulated enough value in both sieves. The algebraic sieve has many more progressions, and thus dominates cost. We cannot compensate by making s much larger, since the pipeline becomes very wide and the device exceeds the capacity of a wafer. rationalalgebraic

27 Optimization: cascaded sieves The algebraic sieve will consider only the indices that passed the rational sieve. algebraic rational In the algebraic sieve, we still scan the indices at a rate of thousands per clock cycle, but only a few of these have to be considered.   much narrower bus  s increased to 32,768

28 Performance Asymptotically: speedup of compared to traditional sieving. For 512-bit composites: One silicon wafer full of TWIRL devices completes the sieving in under 10 minutes ( sec per sieve line of length 1.8 £ ). 1,600 times faster than best previous design. Larger composites?

29 Estimating NFS parameters Predicting cost requires estimating the NFS parameters (smoothness bounds, sieving area, frequency of candidates etc.). Methodology: [Lenstra,Dodson,Hughes,Leyland] Find good NFS polynomials for the RSA-1024 and RSA-768 composites. Analyze and optimize relation yield for these polynomials according to smoothness probability functions. Hope that cycle yield, as a function of relation yield, behaves similarly to past experiments.

bit NFS sieving parameters Smoothness bounds: Rational:3.5 £ 10 9 Algebraic: 2.6 £ Region: a 2 {-5.5 £ 10 14,…,5.5 £ } b 2 {1,…,2.7 £ 10 8 } Total: 3 £ ( £ 6/  2 )

31 TWIRL for 1024-bit composites A cluster of 9 TWIRLS can process a sieve line (10 15 indices) in 34 seconds. To complete the sieving in 1 year, use 194 clusters. Initial investment (NRE): ~$20M After NRE, total cost of sieving for a given 1024-bit composite: ~10M $  year (compared to ~1T $  year). A R R R R R R R R

32 The matrix step We look for elements from the kernel of a sparse matrix over GF(2). Using Wiedemann’s algorithm, this is reduced to the following: Input: a sparse D x D binary matrix A and a binary D -vector v. Output: the first few bits of each of the vectors Av, A 2 v, A 3 v,..., A D v (mod 2). D is huge ( e.g.,  10 9 )

33 The matrix step (cont.) Bernstein proposed a parallel algorithm for sparse matrix-by-vector multiplication with asymptotic speedup Alas, for the parameters of choice it is inferior to straightforward PC-based implementation. We give a different algorithm which reduces the cost by a constant factor of 45,000.

34 Matrix-by-vector multiplication X= ? ? ? ? ? ? ? ? ? ? Σ (mod 2)

35 A routing-based circuit for the matrix step [Lenstra,Shamir,Tomlinson,Tromer 2002] Model: two-dimensional mesh, nodes connected to · 4 neighbours. Preprocessing: load the non-zero entries of A into the mesh, one entry per node. The entries of each column are stored in a square block of the mesh, along with a “target cell” for the corresponding vector bit.

36 Operation of the routing-based circuit To perform a multiplication: Initially the target cells contain the vector bits. These are locally broadcast within each block (i.e., within the matrix column). A cell containing a row index i that receives a “1” emits an value (which corresponds to a at row i ). Each value is routed to the target cell of the i -th block (which is collecting ‘s for row i ). Each target cell counts the number of values it received. That’s it! Ready for next iteration i i i

37 How to perform the routing? Routing dominates cost, so the choice of algorithm (time, circuit area) is critical. There is extensive literature about mesh routing. Examples: Bounded-queue-size algorithms Hot-potato routing Off-line algorithms None of these are ideal.

38 Clockwise transposition routing on the mesh Very simple schedule, can be realized implicitly by a pipeline. Pairwise annihilation. Worst-case: m 2 Average-case: ? Experimentally: 2m steps suffice for random inputs – optimal. The point: m 2 values handled in time O(m). [Bernstein] One packet per cell. Only pairwise compare-exchange operations ( ). Compared pairs are swapped according to the preference of the packet that has the farthest to go along this dimension.

39 Comparison to Bernstein’s design Time: A single routing operation (2m steps) vs. 3 sorting operations (8m steps each). Circuit area: Only the move; the matrix entries don’t. Simple routing logic and small routed values Matrix entries compactly stored in DRAM (~1/100 the area of “active” storage) Fault-tolerance Flexibility 1/12 1/3 i

40 Improvements Reduce the number of cells in the mesh (for small μ, decreasing #cells by a factor of μ decreases throughput cost by ~μ 1/2 ) Use Coppersmith’s block Wiedemann Execute the separate multiplication chains of block Wiedemann simultaneously on one mesh (for small K, reduces cost by ~K) Compared to Bernstein’s original design, this reduces the throughput cost by a constant factor 1/7 1/15 1/6 of 45,000.

41 Implications for 1024-bit composites: Sieving step: ~10M $  year (including cofactor factorization). Matrix step: <0.5M $  year Other steps: unknown, but no obvious bottleneck. This relies on a hypothetical design and many approximations, but should be taken into account by anyone planning to use 1024-bit RSA keys. For larger composites (e.g., 2048 bit) the cost is impractical.

42 Conclusions 1024-bit RSA is less secure than previously assumed. Tailoring algorithms to the concrete properties of available technology can have a dramatic effect on cost. Never underestimate the power of custom-built highly-parallel hardware.

43.