Distributed Linear Algebra Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000.

Slides:



Advertisements
Similar presentations
1 A triple erasure Reed-Solomon code, and fast rebuilding Mark Manasse, Chandu Thekkath Microsoft Research - Silicon Valley Alice Silverberg Ohio State.
Advertisements

Lecture 19: Parallel Algorithms
Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.
6.1 Vector Spaces-Basic Properties. Euclidean n-space Just like we have ordered pairs (n=2), and ordered triples (n=3), we also have ordered n-tuples.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Matrices. Special Matrices Matrix Addition and Subtraction Example.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
1 Parallel Algorithms III Topics: graph and sort algorithms.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Orthogonality and Least Squares
10.1 Gaussian Elimination Method
Linear regression models in matrix terms. The regression function in matrix terms.
Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.
May 29, 2008 GNFS polynomials Peter L. Montgomery Microsoft Research, USA 1 Abstract The Number Field Sieve is asymptotically the fastest known algorithm.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Compiled By Raj G. Tiwari
ECON 1150 Matrix Operations Special Matrices
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
Matrix Algebra. Quick Review Quick Review Solutions.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
October,2006 Higher- Degree Polynomials Peter L. Montgomery Microsoft Research and CWI 1 Abstract The Number Field Sieve is asymptotically the fastest.
Some matrix stuff.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
8.1 Matrices & Systems of Equations
IT253: Computer Organization
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slide Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
CS240A: Conjugate Gradients and the Model Problem.
Copyright © Cengage Learning. All rights reserved. 7 Linear Systems and Matrices.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.
Greatest Common Divisors & Least Common Multiples  Definition 4 Let a and b be integers, not both zero. The largest integer d such that d|a and d|b is.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Copyright © Cengage Learning. All rights reserved. 2 SYSTEMS OF LINEAR EQUATIONS AND MATRICES.
Algebra Matrix Operations. Definition Matrix-A rectangular arrangement of numbers in rows and columns Dimensions- number of rows then columns Entries-
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 1.4 Linear Equations in Linear Algebra THE MATRIX EQUATION © 2016 Pearson Education, Ltd.
Matrices CHAPTER 8.9 ~ Ch _2 Contents  8.9 Power of Matrices 8.9 Power of Matrices  8.10 Orthogonal Matrices 8.10 Orthogonal Matrices 
Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.
13.3 Product of a Scalar and a Matrix.  In matrix algebra, a real number is often called a.  To multiply a matrix by a scalar, you multiply each entry.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Numerical Algorithms Chapter 11.
MTH108 Business Math I Lecture 20.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
12-1 Organizing Data Using Matrices
Linear Equations in Linear Algebra
Matrix Operations.
Matrix Operations SpringSemester 2017.
Linear Equations in Linear Algebra
CSCE569 Parallel Computing
Factoring RSA Moduli: Current State of the Art J
Matrix Operations SpringSemester 2017.
3.5 Perform Basic Matrix Operations Algebra II.
Presentation transcript:

Distributed Linear Algebra Peter L. Montgomery Microsoft Research, Redmond, USA RSA 2000 January 17, 2000

Role of matrices in factoring n Sieving finds many x j 2  p i e ij (mod n). Raise jth to power s j = 0 or 1, multiply. Left side always a perfect square. Right is a square if exponents  j e ij s j are even for all i. Matrix equation Es  0 (mod 2), E known. Knowing x 2  y 2 (mod n), test GCD(x  y, n). Matrix rows represent primes p i. Entries are exponents e ij. Arithmetic is over GF(2).

Matrix growth on RSA Challenge RSA–140 Jan-Feb,  Weight Omit primes < Cray-C90 hours 75% of 800 Mb for matrix storage RSA–155 August,  Weight Omit primes < Cray-C90 hours 85% of 1960 Mb for matrix storage

Regular Lanczos A positive definite (real, symmetric) n  n matrix. Given b, want to solve Ax = b for x. Set w 0 = b. w i+1 = Aw i – Σ 0  j  i c ij w j if i  0 c ij = w j T A 2 w i / w j T Aw j Stop when w i+1 = 0.

Claims w j T Aw j  0 if w j  0 (A is positive definite). w j T Aw i = 0 whenever i  j (by choice of c ij and symmetry of A). Eventually some w i+1 = 0, say for i = m (otherwise too many A-orthogonal vectors). x = Σ 0  j  m (w j T b / w j T Aw j ) w j satisfies Ax=b (error u=Ax–b is in space spanned by w j ’s but orthogonal to all w j, so u T u=0 and u=0).

Simplifying c ij when i > j+1 w j T Aw j c ij = w j T A 2 w i = (Aw j ) T (Aw i ) = (w j+1 + linear comb. of w 0 to w j ) T (Aw i ) = 0 (A-orthogonality). Recurrence simplifies to w i+1 = Aw i – c ii w i – c i,i–1 w i–1 when i  1. Little history to save as i advances.

Major operations needed Pre-multiply w i by A. Inner products such as w j T Aw j and w j T A 2 w i = (Aw j ) T (Aw i ). Add scalar multiple of one vector to another.

Adapting to Bx=0 over GF(2) B is n 1  n 2 with n 1  n 2, not symmetric. Solve Ax = 0 where A = B T B. A is n 2  n 2. B T has small nullspace in practice. Right side zero, so Lanczos gives x = 0. Solve Ax = Ay where y is random. u T u and u T Au can vanish when u  0. Solved by Block Lanczos (Eurocrypt 1995).

Block Lanczos summary Let N be the machine word length (typically 32 or 64) or a small multiple thereof. Vectors are n 1  N or n 2  N over GF(2). Exclusive OR and other hardware bitwise instructions operate on N-bit data. Recurrences similar to regular Lanczos. Approximately n 1 /(N–0.76) iterations. Up to N independent solutions of Bx=0.

Block Lanczos major operations Pre-multiply n 2  N vector by B. Pre-multiply n 1  N vector by B T. N  N inner product of two n 2  N vectors. Post-multiply n 2  N vector by N  N matrix. Add two n 2  N vectors. How do we parallelize these?

Assumed processor topology Assume a g 1  g 2 toroidal grid of processors. A torus is a rectangle with its top connected to its bottom, and left to right (doughnut). Need fast communication to/from immediate neighbors north, south, east, and west. Processor names are p rc where r is modulo g 1 and c is modulo g 2. Set gridrow(p rc ) = r and gridcol(p rc ) = c.

A torus of processors P7P8P9 P4P5P6 P1P2P3 Example: 3x3 torus system

Matrix row and column guardians For 0  i  n 1, a processor rowguard(i) is responsible for entry i, in all n 1  N vectors. For 0  j  n 2, a processor colguard(j) is responsible for entry j, in all n 2  N vectors. Processor-assignment algorithms aim for load balancing.

Three major operations Vector addition is pointwise. When adding two n 2  N vectors, processor colguard(j) does the j-th entries. Data is local. Likewise for n 2  N vector by N  N matrix. Processors form partial N  N inner products. Central processor sums them. These operations need little communication. Workloads are O(#columns assigned).

Allocating B among processors Let B = (b ij ) for 0  i  n 1 and 0  j  n 2. Processor p rc is responsible for all b ij where gridrow(rowguard(i)) = r and gridcol(colguard(j)) = c. When pre-multiplying by B, the input data from colguard(j) will arrive along grid column c, and the output data for rowguard(i) will depart along grid row r.

Multiplying u = Bv where u is n 1  N and v is n 2  N Distribute each v[j] to all p rc with gridcol(colguard(j)) = c. That is, broadcast each v[j] along one column of the grid. Each p rc processes all of its b ij, building partial u[i] outputs. Partial u[i] values are summed as they advance along a grid row to rowguard(i). Individual workloads depend upon B.

Actions by p rc during multiply Send/receive all v[j] with gridcol(colguard(j)) = c. Zero all u[i] with rowguard(i) = p r,c+1. At time t where 1  t  g 2, adjust all u[i] with rowguard(i) = p r,c+t (t nodes east).. If t  g 2, ship these u[i] west to p r,c–1 and receive other u[i] from p r,c+1 on the east. Want balanced workloads at each t.

Multiplication by B T Reverse roles of matrix rows and columns. Reverse roles of grid rows and columns. B T and B can share storage since same processor handles (B) ij during multiply by B as handles (B T ) ji during multiply by B T.

Major memory requirements Matrix data is split amongst processors. With  cache-friendly blocks, an entry needs only two 16-bit offsets. Each processor needs one vector of length max(n 1 /g 1, n 2 /g 2 ) and a few of length n 2 /g 1 g 2, with N bits per entry. Central processor needs one vector of length n 2 plus rowguard and colguard.

Major communications during multiply by B Broadcast each v[j] along entire grid column. Ship n 2 N bits to each of g 1 –1 destinations. Forward partial u[i] along grid row, one node at a time. Total (g 2 –1)n 1 N bits. When n 2  n 1, communication for B and B T is 2(g 1 +g 2 –2)n 1 N bits per iteration. 2(g 1 +g 2 –2)n 1 2 bits after n 1 /N iterations.

Choosing grid size Large enough that matrix fits in memory. Matrix storage is about 4w/g 1 g 2 bytes per processor, where w is total matrix weight. Try to balance I/O and computation times. Multiply cost is O(n 1 w/g 1 g 2 ) per processor. Communications cost O((g 1 +g 2 –2)n 1 2 ). Prefer a square grid, to reduce g 1 +g 2.

Choice of N and matrix Prefer smaller but heavier matrix if it fits, to lessen communications. Higher N yield more dependencies, letting you omit the heaviest rows from the matrix. Larger N means fewer but longer messages. Size of vector elements affects cache. When N is large, inner products and post- multiplies by N  N matrices are slower.

Cambridge cluster configuration Microsoft Research, Cambridge, UK. 16 dual-CPU 300 MHz Pentium II’s. Each node –384 MB RAM –4 GB local disk Networks –Dedicated fast ethernet (100 Mb/sec) –Myrinet, M2M-OCT-SW8 (1.28 Gb/sec)

Message Passing Interface (MPI) Industry Standard MPI implementations: –exist for the majority of parallel systems & interconnects –public domain (e.g. mpich) or commercial (e.g. MPI PRO) Supports many communications primitives including virtual topologies (e.g. torus).

Performance data from MSR Cambridge cluster