March 14, 20021 CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton www.cs.ualberta.ca/~cbarton.

Slides:



Advertisements
Similar presentations
February Best Practices in Advancement Services Customer Service: Benchmarking with the Best Jennifer Houlihan Warwick Loyola Marymount University.
Advertisements

12/09/20021 Engineer Training Program Drivers & AP Installation Guide for N34BS3 Written By Suzanne Yu Uniwill Computer Intl Corp Gateway Blvd. Fremont,
Overview of Full Use Guide on Electric Power Distribution Reliability Indices Panel Session – How to Define Major Events July 22, 2002 Presented.
and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $
HRTC Hard Real-time CORBA IST WP3 / K. Nilsson / Viena September 11-13, HRTC Robot Testbed
Summer Time, Rate, and Productivity Management of Operations Brad C. Meyer.
KNF Pocket Card Klamath NF Pocket Cards Fire Season 2002.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Data Structures ADT List
Chapter 1 Object Oriented Programming 1. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Modelling Robustness Part 1: Prologue Fabrice Saffre.
/ department of mathematics and computer science TU/e technische universiteit eindhoven WISE 2002December 12, RAL: an RDF Algebra Flavius Frasincar.
Welcome to CMPE003 Personal Computers: Hardware and Software Dr. Chane Fullmer Fall 2002 UC Santa Cruz.
Register Allocation Consists of two parts: Goal : minimize spills
Cocoa Butter Crystallisation
Addition 1’s to 20.
Test B, 100 Subtraction Facts
CE80N Introduction to Networks & The Internet Dr. Chane L. Fullmer UCSC Winter 2002.
Welcome to CMPE003 Personal Computers: Hardware and Software Dr. Chane Fullmer Fall 2002 UC Santa Cruz.
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Free-text Medical Document Retrieval via Phrase-based Vector Space Model Wenlei Mao, MS and Wesley W. Chu, PhD and Computer.
European Tax Issues of Mergers & Reorganizations - An Overview - Geerten M.M. Michielse Technical Assistance Advisor to the IMF Georgetown University Law.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Compiler Construction
Overview Structural Testing Introduction – General Concepts
January 22, What is a Function?. January 22, What is a Function? Central service agency (CSA) is central to the operation of State government.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
8. Static Single Assignment Form Marcus Denker. © Marcus Denker SSA Roadmap  Static Single Assignment Form (SSA)  Converting to SSA Form  Examples.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
1 Copy Propagation What does it mean? Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.
1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.
Introduction to Optimization, II Value Numbering & Larger Scopes Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Memory-Aware Compilation Philip Sweany 10/20/2011.
1 Removing Impediments to Loop Fusion Through Code Transformations Bob Blainey 1, Christopher Barton 2, and Jos’e Nelson Amaral 2 1 IBM Toronto Software.
Automatic Thread Extraction with Decoupled Software Pipelining
Simone Campanoni CFA Simone Campanoni
Simone Campanoni Loop transformations Simone Campanoni
Princeton University Spring 2016
Topic 10: Dataflow Analysis
Faculty of Computer Science & Information System
October 18, 2018 Kit Barton, IBM Canada
Optimizations using SSA
Introduction to Optimization
Presentation transcript:

March 14, CMPUT680 - Winter 2006 Topic C: Loop Fusion Kit Barton

March 14, Outline Definition of loop fusion Basic concepts Prerequisites of loop fusion A loop fusion algorithm Example

March 14, Loop Fusion Combine 2 or more loops into a single loop This cannot violate any dependencies between the loop bodies Several conditions which must be met for fusion to occur Often these conditions are not initially satisfied

March 14, Advantages of Loop Fusion Save increment and branch instructions Creates opportunities for data reuse Provide more instructions to instruction scheduler to balance the use of functional units

March 14, Disadvantages of Loop Fusion Increase code size effecting instruction cache performance Increase register pressure within a loop Could cause the formation of loops with more complex control flow

March 14, Background There has been extensive work done on loop fusion Most has focused on weighted loop fusion (Gao et al., Kennedy and McKinley, Megiddo and Sarkar) Extensive work has also been done it performing loop fusion to increase parallelism

March 14, Weighted Loop Fusion Associates non-negative weights with each pair of loop nests Weights are a measurement of the expected gain if the two loops are fused Gains include potential for array contraction, data reuse and improved local register allocation

March 14, Optimal Loop Fusion Fuse loops to optimize data reuse, taking into consideration resource constraints and register usage This problem is NP-Hard

March 14, Maximal Loop Fusion Our approach is to perform maximal loop fusion Fuse as many loops as possible, without considering resource constraints Fuse loops as soon as possible, not considering the consequences

March 14, 2002Allen & Kennedy, p. 150, Dominators and Post Dominators A node x in a directed graph G with a single exit node dominates node y in G if any path from the entry node of G to y must pass through x A node x in a directed graph G with a single exit node post-dominates node y in G if any path from y to the exit node of G must pass through x

March 14, Requirements for Loop Fusion i.Loops must have identical iteration counts (be conforming) ii.Loops must be control-flow equivalent iii.Loops must be adjacent iv.There cannot be any negative distance dependencies between the loops

March 14, Non-conforming Loops If iteration counts are different, one loop must be manipulated to make the iteration counts the same 1.Loop peeling 2.Introduce a guard into one of the loops

March 14, Loop Peeling Find the difference between the iteration count of the two loops (n) Duplicate the body of the loop with the higher iteration count n times Update the iteration count of the peeled loop

March 14, Loop Peeling Example while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 12) { b[j] = b[j - 1] - 2; j++; } while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 10) { b[j] = b[j - 1] - 2; j++; } b[j] = b[j - 1] - 2; j++; b[j] = b[j - 1] - 2; j++;

March 14, Guarding Iterations Increase the iteration count of the loop with fewer iterations Insert a guard branch around statements that would not normally be executed

March 14, Guarding Iterations Example while (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 12) { b[j] = b[j - 1] - 2; j++; } while (i < 12) { if (i < 10) { a[i] = a[i - 1] * 2; i++; } while (j < 12) { b[j] = b[j - 1] - 2; j++; }

March 14, Loop Peeling Advantage: Does not generate control flow within a loop body Disadvantage: Generates additional code outside of loops, which could possible intervene with other loops

March 14, Guarding Iterations Advantages: Does not introduce intervening code Can be “undone” later Disadvantage: Generates control flow within a loop

March 14, Control Flow Equivalence Two loops are control-flow equivalent if when one executes, the other also executes Loop 1 BB Loop2 Loop 1 Loop 3 BB Loop2

March 14, Determining Control Flow Equivalence Use the concepts of dominators and post dominators. Two loops L1 and L2 are control-flow equivalent if the following two conditions are true: L1 dominates L2; and L2 post dominates L1.

March 14, Intervening Code Two loops are adjacent if there are no statements between the two loops Can be determined using the CFG: If the immediate successor of the first loop is the second loop, the two loops are adjacent If two loops are not adjacent, there is intervening code between them.

March 14, Dealing with Non-Adjacent Loops If two loops are not adjacent, we attempt to make them adjacent by moving the intervening code Intervening code can be moved: Above the first loop Below the second loop Both as long as no data dependencies are violated

March 14, Intervening Code Example Assume CFG has 20 nodes 0-5 are above Loop are below Loop 2 What algorithm should be used to determine which nodes are between Loop1 and Loop2? Loop 1 Loop

March 14, Gathering Intervening Code Given two loops L1 and L2, a basic block B is intervening code between L1 and L2 if and only if: oB is strictly dominated by L1 oB is not dominated by L2 Once the dominance relations are known, the set subtraction can be efficiently computed using bit vectors

March 14, Intervening Code Example Loop 1 Loop Loop Loop Difference

March 14, Analyze Intervening Code Build a DDG of the intervening code Put all nodes with no predecessors into queue For each node in the queue: If there are no dependencies between the node and the loop Mark node as moveable Add all of the nodes immediate successors to the queue All nodes marked can be moved around the loop

March 14, Non-Adjacent loops example while (i < N) { a += i; i++; } b := a * 2; c := b + 6; g := 0; h := g + 10; if (c < 100) d := c/2; else e := c * 2; while (j < N) { f := g + 6; j++; } b := a * 2; c := b + 6; g := 0; if (c < 100) d := c/2; else e := c * 2; h := g + 10;

March 14, Non-Adjacent loops example while (i < N) { a += i; i++; } b := a * 2; c := b + 6; g := 0; h := g + 10; if (c < 100) d := c/2; else e := c * 2; while (j < N) { f := g + 6; j++; } g := 0; h := g + 10; while (i < N) { a += i; i++; } while (j < N) { f := g + 6; j++; } b := a * 2; c := b + 6; if (c < 100) d := c/2; else e := c * 2;

March 14, Non-Adjacent loops example b := a * 2; c := b + 6; g := 0; if (c < 100) d := c/2; else e := c * 2; h := g + 10; Node Queue b := a * 2; g := 0; DDGLoop 2 Moveable Nodes c := b + 6; if (c < 100) d := c/2; else e := c * 2; b := a * 2; c := b + 6; if (c < 100) d := c/2; else e := c * 2; while (j < N) { f := g + 6; j++; }

March 14, Non-Adjacent loops example b := a * 2; c := b + 6; g := 0; if (c < 100) d := c/2; else e := c * 2; h := g + 10; Node Queue b := a * 2; g := 0; DDGLoop 1 Moveable Nodes h := g + 10; g := 0; h := g + 10; while (i < N) { a += i; i++; }

March 14, Dependencies Preventing Fusion i = j = 1; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; } Can the following loops be fused?

March 14, Dependencies Preventing Fusion If we look at the array access patterns of a[], we see the following a[i] = c[i] + 10; b[j] = a[j+1] * 2;

March 14, Dependencies Preventing Fusion By aligning the array access patterns, we get the following: a[i] = c[i] + 10; b[j] = a[j+1] * 2;

March 14, Loop Alignment i = j = 1; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; } j = 1; i = 2 a[1] = c[1] + 10; while (i < 10) { a[i] = c[i] + 10; i++; } while (j < 10) { b[j] = a[j+1] * 2; j++; }

March 14, Loop Alignment Loop alignment can be used to remove dependencies between loop bodies Easy to do when all dependencies have the same distance Gets tricky when there are multiple dependencies with different distances

March 14, Putting it all together We’ve seen ways to deal with each of the preconditions of loop fusion If the conditions are not met, we apply transformations to try and modify the code If the transformations are successful, loop fusion can occur But in what order should these transformations be applied?

March 14, Loop Fusion Algorithm For each N i from outermost to innermost: Gather control equivalent loops in N i into LoopSets For each set S i in LoopSets remove non-eligible loops from S i FusedLoops = true Direction = forward while FusedLoops == true if |S i | < 2 break Compute Dominance Relation FusedLoops = LoopFusionPass(S i, Direction) Reverse Direction

March 14, Loop Fusion Algorithm LoopFusionPass(S, Direction) FusedLoops = false For each pair of loops L j and L k in S such that L j dominates L k in Direction if (DependenceDistance(L j, L k ) < 0) continue if (InterveningCode(L j, L k ) == true and IsInterveningCodeMoveable(L j, L k ) == false) continue d = | IterationCount(L j ) – IterationCount(L k ) | if (L j and L k are non-conforming and (d cannot be determined at compile time or d > MAXPEEL)) continue if (L j and L k are non-conforming) Peel iterations MoveInterveningCode(L j, L k ) if InterveningCode(L j, L k ) == false FuseLoops(L j, L k ) FusedLoops = true Return FusedLoops

March 14, Example L1: do i1 = 1, n a(i1) = a(i1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do Loop Set L1 L2 L3 L4

March 14, Peeling Loop 1 L1: do i1 = 1, n a(i1) = a(i1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Fuse L1 and L2 S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end do L2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Compare L5 and L3 We now compare loops L5 and L3 They are not adjacent, but the intervening code can move Difference in iteration count is not know, so fusion fails S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Compare L5 and L4 Intervening Code S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m

March 14, Peel L5 S7: a(1) = a(1) * k1 L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1 d(i5) = a(i5) - b(i5+1) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Move Intervening Code S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do S1: ds = 0.0 L3: do i3 = 1, m ds = ds + d(i3) end do S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Reverse Pass S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do Loop Set L1 L3 L4 Sorted in Reverse Dominance Direction L1 L3 L4

March 14, Compare L4 and L3 No dependencies to prevent fusion Iteration count cannot be determined at compile time Fusion fails S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Compare L4 and L5 Intervening Code L3: do i3 = 1, m ds = ds + d(i3) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, Move Intervening Code S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L3: do i3 = 1, m ds = ds + d(i3) end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L3: do i3 = 1, m ds = ds + d(i3) end do

March 14, Fuse L4 and L1 S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L6: do i5 = 1, n-2 a(i6+2) = a(i6+2) * k1 d(i6+1) = a(i6+1) - b(i6+2) * k2 b(i6) = a(i6) + b(i6) / c(i6) end do L3: do i3 = 1, m ds = ds + d(i3) end do S7: a(1) = a(1) * k1 S8: a(2) = a(2) * k1 S9: d(1) = a(1) - b(2) * k2 S1: ds = 0.0 S2: if (n<m) S3: c(n-2) = n S4: else S5: c(n-2) = m L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end do L4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do L3: do i3 = 1, m ds = ds + d(i3) end do