Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization.

Slides:



Advertisements
Similar presentations
CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,
Advertisements

Lecture 3: Parallel Algorithm Design
MS 101: Algorithms Instructor Neelima Gupta
Divide-and-Conquer The most-well known algorithm design strategy:
1 Divide-and-Conquer The most-well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
CS4413 Divide-and-Conquer
DIVIDE AND CONQUER APPROACH. General Method Works on the approach of dividing a given problem into smaller sub problems (ideally of same size).  Divide.
© 2004 Goodrich, Tamassia Quick-Sort     29  9.
Quick-Sort     29  9.
© 2004 Goodrich, Tamassia Quick-Sort     29  9.
Chapter 1 – Basic Concepts
Chapter 2: Algorithm Analysis
Introduction to Analysis of Algorithms
Chapter 10 in textbook. Sorting Algorithms
Chapter 4 Divide-and-Conquer Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Algorithm Design Techniques: Induction Chapter 5 (Except Sections 5.6 and 5.7)
Sorting Algorithms CS 524 – High-Performance Computing.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
Algorithm Design Techniques: Induction Chapter 5 (Except Section 5.6)
Data Structures, Spring 2006 © L. Joskowicz 1 Data Structures – LECTURE 3 Recurrence equations Formulating recurrence equations Solving recurrence equations.
5 - 1 § 5 The Divide-and-Conquer Strategy e.g. find the maximum of a set S of n numbers.
Lecture 4 Sept 4 Goals: chapter 1 (completion) 1-d array examples Selection sorting Insertion sorting Max subsequence sum Algorithm analysis (Chapter 2)
CSC 3323 Notes – Introduction Algorithm Analysis.
Data Structure Algorithm Analysis TA: Abbas Sarraf
Analysis of Algorithm Lecture 3 Recurrence, control structure and few examples (Part 1) Huma Ayub (Assistant Professor) Department of Software Engineering.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
1 Lecture 11 POLYNOMIALS and Tree sort 2 INTRODUCTION EVALUATING POLYNOMIAL FUNCTIONS Horner’s method Permutation Tree sort.
Analysis of Algorithms
Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Complexity 20-1 Complexity Andrei Bulatov Parallel Arithmetic.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Computer Science 101 Fast Algorithms. What Is Really Fast? n O(log 2 n) O(n) O(n 2 )O(2 n )
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Algorithm Analysis. What is an algorithm ? A clearly specifiable set of instructions –to solve a problem Given a problem –decide that the algorithm is.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Prof. Sumanta Guha Slide Sources: CLRS “Intro.
Data Structures and Algorithms in Parallel Computing Lecture 8.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
0 Introduction to asymptotic complexity Search algorithms You are responsible for: Weiss, chapter 5, as follows: 5.1 What is algorithmic analysis? 5.2.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Sorting: Parallel Compare Exchange Operation A parallel compare-exchange operation. Processes P i and P j send their elements to each other. Process P.
1 Chapter 2 Algorithm Analysis Reading: Chapter 2.
Complexity of Algorithms Fundamental Data Structures and Algorithms Ananda Guna January 13, 2005.
Priority Queues and Heaps. John Edgar  Define the ADT priority queue  Define the partially ordered property  Define a heap  Implement a heap using.
Algorithm Analysis 1.
Chapter 2 Algorithm Analysis
Lecture 3: Parallel Algorithm Design
Analysis of Algorithms
7.1 What is a Sorting Network?
Sorting by Tammy Bailey
Parallel Sorting Algorithms
ICS 353: Design and Analysis of Algorithms
CSCE 411 Design and Analysis of Algorithms
Data Structures Review Session
Topic: Divide and Conquer
Programming and Data Structure
ICS 353: Design and Analysis of Algorithms
Parallel Sorting Algorithms
Programming and Data Structure
ICS 353: Design and Analysis of Algorithms
Transform and Conquer Transform and Conquer Transform and Conquer.
Transform and Conquer Transform and Conquer Transform and Conquer.
CSE 589 Applied Algorithms Spring 1999
ICS 353: Design and Analysis of Algorithms
Presentation transcript:

Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization

Number of operations and execution time Fewer number of operations does not necessarily mean shorter execution times. –Because of scheduling in a parallel environment. –Because of locality. –Because of communication in a parallel program. Nevertheless, although it has to be applied carefully, reducing the number of operations is one of the important optimizations. In this presentation, we discuss transformation to reduce the number of operations or reduce the length of scheduling in an idealized parallel environment where communication costs are zero.

Scheduling Consider the expression tree: It can be shortened by applying –Associativity and commutativity: [a+h+b*(c+g+d*e*f) ] or –Associativity, commutativity and distributivity: [a+h+b*c+b*g+b*d*e*f]. The second expression is the sortest of the three. This means that with enough resources the third expression is the fastest although is has the most operations. + h + a * b + + c * f * e g d

Locality Consider: do i=1.n c(i) = a(i)+b(i)+a(i)/b(i) end do … do i=1,n x(i) = (a(i)+b(i))*t(i)+a(i)/b(i) end do do i=1,n d(i) = a(i)/b(i) c(i) = a(i)+b(i)+d(i) end do … do i=1,n x(i) = (a(i)+b(i))*t(i)+d(i) end do The sequence on the right executes fewer operations, but, if n is large enough, it also incurs in more cache misses. (We assume that t is computed between the two loops so that they cannot be fused.)

Communication in parallel programs Consider: cobegin … do i=1,n a(i) =.. end do send a(1:n) … // … receive a(1:n) … coend cobegin … do i=1,n a(i) =.. end do … // … do i=1,n a(i) =.. end do … coend The sequence on the right executes more operation s, but it would execute faster if the send operation is expensive.

Approaches to reducing cost of computation Eliminate (syntactically) redundant computations. Apply algebraic transformations to reduce the number of operations. Decompose sequential computations for parallel execution. Apply algebraic transformations to reduce the height of expressions trees and thus reduce execution time in a parallel environment.

Elimination of redundant computations Many of the transformations were discussed in the context of compiler transformations. –Common subexpression elimination –Loop invariant removal –Elimination of redundant counters –Loop unrolling (not discussed, but should have). It eliminates bookkeeping operations.

However, compilers will not eliminate all redundant computations. Here is an example where user intervention is needed: The following sequence do i=1,n s = a(i)+s end do … do i=1,n-1 t = a(i)+t end do …t…

May be replaced by do i=1,n-1 t = a(i)+t end do s=t+a(n) … …t… This transformation is not usually done by compilers.

2.Another example, from C, is the loop for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { a[i,j]=0; } Which, if a is n × n, can be transformed into the loop below that has fewer bookkeeping operations. b=a; for (i = 0; i < n*n; i++) { *b=0; b++; }

Applying algebraic transformations to reduce the number of operations For example, the expressions a*(b*c)+(b*a)*d+a*e can be transformed into (a*b)*(c+d)+a*e by distributivity and then by associativity and distributivity into a*(b*(c+d)+e). Notice that associativity has to be applied with care. For example, suppose we are operating on floating point values and that x is very much larger than y and z=-x. Then (y+x)+z may give 0 as a result, while y+(x+z) gives y as an answer.

The application of algebraic rules can be very sophisticated. Consider the computation of x n. A naïve implementation would require n-1 multiplications. However, if we represent n in binary as n=b 0 +2(b 1 +2(b 2 + …)) and notice that x n =x b0 (x b1+2(b2 + …) ) 2, the number of multiplications can be reduced to O(log n).

function power(x,n) (assume n>0) if n==1 then return x if n%2==1 then return x*power(x,n-1) else x=power(x,n/2); return x*x

Horner’s rule A polynomial A(x) = a 0 + a 1 x + a 2 x² + a 3 x³ +... may be written as A(x) = a 0 + x(a 1 + x(a 2 + x(a ))). As a result, a polynomial may be evaluated at a point x', that is A(x') computed, in Θ(n) time using Horner's rule. That is, repeated multiplications and additions, rather than the naive methods of raising x to powers, multiplying by the coefficient, and accumulating.Θ(n)

Conventional matrix multiplication Asymptotic complexity: 2n 3 operations Each recursion step (blocked version): 8 multiplications, 4 additions

Strassen’s Algorithm Asymptotic complexity: O(n log 2 7 ) = O(n 2.8… ) operations Each recursion step: 7 multiplications, 18 additions/subtractions Asymptotic complexity is solution of T(n)=7T(n/2)+18(n/2) 2

Winograd Asymptotic complexity: O(n )operations Each recursion step: 7 multiplications, 15 additions/subtractions

Parallel matrix multiplication Parallel matrix multiplication can be accomplished without redundant operations. First observe that the time to compute a sum of n elements, given enough resources, is.

Time:

With sufficient replication and computational resources matrix multiplication can take just one multiplication step and additions

Copying can also be done in logarithmic steps

Parallelism and redundancy Algebra rules can be applied to reduce tree height. In some cases, the height of the tree is reduced at the expense of an increase in the number of operations

Parallel Prefix

Redundancy in parallel sorting. Sorting networks.

Comparator (2-sorter) x y min(x, y) max(x, y) inputs outputs

Comparison Network n / 2 comparisons per stage d stages

Sorting Networks Sorting Network inputs outputs sorted

Insertion Sort Network inputs outputs depth 2n 3

comparator stages comparators Odd-even transposition sort O(n)O(n) O(n2)O(n2) Bubblesort O(n)O(n) O(n2)O(n2) Bitonic sort O(log(n) 2 ) O(n·log(n) 2 ) Odd-even mergesort O(log(n) 2 ) O(n·log(n) 2 ) Shellsort O(log(n) 2 ) O(n·log(n) 2 )