Compiler Speculative Optimizations Wei Hsu 7/05/2006.

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
CS 352H: Computer Systems Architecture
Optimization Code Optimization ©SoftMoore Consulting.
5.2 Eleven Advanced Optimizations of Cache Performance
Henk Corporaal TUEindhoven 2009
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP, Branch Prediction
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

Compiler Speculative Optimizations Wei Hsu 7/05/2006

Speculative Execution  It means the early execution of code, the result of which may not be needed (work may be wasteful).  In pipelined processor, speculative execution is often used to reduce the cost of branch mis- predictions.  Some processors automatically prefetch the next instruction and/or data cache lines into the on-chip caches. Prefetch has also been used for disk read.  More aggressive speculative execution has been used in “run-ahead” or “execute-ahead” processors to warm up the caches.  Value prediction/speculation is another example

Compiler Controlled Speculation  Speculation is one of the most important methods for finding and exploiting ILP.  Allows the execution to exploit statistical ILP ( e.g. a branch is taken 90% of time, or the address of pointer *p is different from the address of pointer *q most of the time )  To overcome two most common constraints for instruction scheduling (and other optimizations)  Control dependence  Memory dependence

Compiler Controlled Speculation (cont.)  Allows compiler to issue operation early before a dependency  Removes latency of operation from the critical path  Helps hiding long latency memory operations  Control Speculation –the execution of an operation before the branch which guards it  Data Speculation –which is the execution of a memory load prior to a preceding store which may alias with it

Control Speculation  Example lw $r1, 0($r2) add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… sub$r3, $r6… bne… In this block, there is no room to schedule the load !! Why not moving the load instruction into the previous block?

Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } sub$r3, $r6… bne… 1)Is the cond most likely to be true? profile feedback may guide the optimization 2) What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? lw $r1, 0($r2) lw $r6,…

Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause a memory fault? can we have a special load instruction that ignores memory faults? Faul t!! Core dump

Control Speculation  Example add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? Make this special inst, so it never faults!! For example, Sparc supports non-faulting load instructions that can ignore segmentation faults…

Architecture Supports in SparcV9  SparcV9 provides non-faulting loads (similar to silent loads used in Multiflow’s Trace and Cydrome’s Cydra-5 computers).  Nonfaulting loads execute as any other loads except that, segmentation fault conditions do not cause program termination.  To minimize page faults when a speculative load references a Null pointer (address zero), it is desirable to map low addresses (especially address zero) to a page with special attribute.  Non-faulting loads are often used in data prefetching, but are not for general code motions.

Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; load $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!

Using non-faulting loads for prefetching While (j < k) { i=Index[j][1]; x=array[i]; y=x+… j+=m; } While (j < k) { load $r1, index[j][1]; load $r2, array($r1) add $r3, $r2,$r4 …. } May incur cache misses on each iteration While (j < k) { load $r1, index[j][1]; nf-ld $r5, index[j+m][1]; load $r2, array($r1) prefetch array($r5) add $r3, $r2,$r4 …. } load $r5 may fault !!

Non-faulting Loads Insufficient add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) beq… If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… beq… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!!

check.s $r1 add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… What if the address of p is bad, and cause memory fault? can we have a special load instruction that ignores memory faults? But what if the real load of p cause a memory fault? We cannot just ignore it!! Let’s remember the fault status, and check when the loaded data is actually used

check.s $r1, recovery add$r3, $r1,$r4 lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… bne… Recovery Code

check.s $r3, recovery lw$r5,4($r3) sw$r5,4($sp) … recovery: lw $r1, 0($r2) add$r3, $r1,$r4 If (cond) { A=p[i]->b; } lw $r6,… lw.s $r1, 0($r2) sub$r3, $r6… add$r3, $r1,$r4 bne… Recovery Code All instructions that are data dependent on the speculative load and moved with it must go to the recovery block

Architecture Supports in IA64 control speculation original: (p1)br.cond ld8 r1 = [ r2 ] transformed: ld8.s r1 = [ r2 ]... (p1)br.cond... chk.s r1, recovery Control dependence

Data Speculation  Example lw$r3,4($sp) sw $r3, 0($r1) lw$r5,0($r2) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } In this block, there is no room to schedule the load !! How can we move the load instruction ahead of the store? $r2 and $r1 may be different most of the time, but could possibly be the same.

Data Speculation  Example sw $r3, 0($r1) addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2)

Data Speculation  Example sw $r3, 0($r1) If (r1==r2) copy $r5,$r3 addi$r6,$r5,1 sw$r6,8($sp) { *p = a; b= *q + 1; } lw$r3,4($sp) lw$r5,0($r2) What if there are m loads moving above n stores?  m x n comparisons must be generated !!  So some HW/AR supports are needed

Architecture Supports in IA64 Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery Memory dependence

ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r2)=0x00001ab0

ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x a There is no match in the ALAT table. No change to ALAT. chk.a find entry r1 in ALAT It turns into a NOP

ALAT (Advance Load Address Table) Data Speculation original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] transformed: ld8.a r1 = [ r2 ]... st4 [ r3 ] = r7... chk.a r1, recovery r1 0x1ab0 Assume (r3)= 0x00001ab0 There is a match in the ALAT table. The r1 entry will be removed chk.a find no entry of r1 in ALAT, check failed, branch to recovery routine

More Cases for Data Speculation  Many high performance architectural features are not effectively exploited by compilers due to imprecise analysis. Examples:  Automatic vectorization / parallelization  Local memory allocation / assignment  Register allocation  …

Examples  Vectorization loop (k=1; k<n; k++) a[k] = a[j] * b[k]; end  Register Allocation = a->b; *p = … = a->b; What if a,b are pointers? What if j == k? Can we allocate a->b to a register? Could *p modify a->b? or a ?

 Complete alias and dependence analysis are costly and difficult  need Inter-procedural analysis  hard to handle dynamic allocated memory objects  runtime disambiguation is expensive But … true memory dependence rarely happen!!

Static and Dynamic Dependences  Most ambiguous data dependences identified by compiler do not occur at runtime

 Speculation can compensate for imprecise alias information if speculation failure can be efficiently detected and recovered  Can we effectively use hardware supports to speculatively promote memory references to registers?  Can we speculatively vectorize or parallelize loops?

Motivation Example … = *p *q =.. … = *p Original program ld r32=[r31] *q = … ld r32=[r31] … = r32 Traditional compiler code

Another Example if (p->s1->s1->x1) { …. *ip = 0; p->s1->s1->x1++; …. } Original program ld8 r14=[r32] adds r14=8,r14 ld8 r14 = [r14] ld4 r14 = [r14] cmp4 p6,p7=0,r14 (p6) br…. st [r16] = r0 ld8 r14=[r32] adds r14=8,r14 ld8 r15 = [r14] ld4 r14 = [r15] adds r14=1,r14 st4 [r15] = r14 Traditional compiled code

Our approach at UM  Use alias profile or compiler heuristics to obtain approximated alias information  Use data speculation to verify such alias information at run time  Use the Advance Load Address Table (ALAT) in IA64 for the necessary support of data speculation

Background of ALAT in IA64 ; ;

Speculative Register Promotion  Use ld.a for the first load  Check the subsequent loads –Scheme 1: use ld.c for subsequent reads to the same reference. –Scheme 2: use chk.a for subsequent reads. This allows promotion of multi-level pointer variables. ( e.g. if a->b->c is speculatively promoted to a register, but a is aliased and modified, then the recovery code to reload a, a->b and a->b->c must be executed )

Examples =*p+1; *q=… =*p+3; ld.a r1=[p] add r3=r1,1 *q = …. ld.c r1=[p] add r4=r1, 3 a. read after read *p= ; *q =…. …=*p+3; st [p]=r1 ld.a r1=[p] *q = …. ld.c r1=[p] add r4=r1, 3 b. read after write =*p; *q = … =*p; *q = … =*p; ld.a r1=[p] *q = … ld.c.nc r1=[p] *q = … ld.c.clr r1=[p] c. multiple redundant loads

Compiler Support for Speculative Register Promotion  Enhanced SSA form with the notion of data speculation  SSA form for indirect memory reference –  operator : MayMod –  operator : MayUse  Speculative SSA form –  s operator: the variable in  s is unlikely to be updated by the corresponding definition statement –  s operator: the variable in  s is unlikely to be referenced by the indirect reference

Speculative SSA Form According To Alias Profiling *p = b 2  (b 1 ) a 2  (a 1 ) v 2  (v 1 )  (b 1 )  (a 1 )  (v 1 ) = *p The two examples assume that the points-to set of p generated by compiler is {a, b}, the points-to set of p obtained from alias profiling is {b}. v is the virtual variable for *p. a j stands for version j of variable a. s s s s

Overview of Speculative Register Promotion*  Phi insertion  Rename  Down_safety  Will_be_available  Finalize  Code motion * Based on SSAPRE [Kennedy, et.al. ACM TOLPAS ‘99]

Enhanced Rename … = a 1 *p 1 = … v 2  (v 1 ) a 2  (a 1 ) b 2  (b 1 ) … = a 2 a) Traditional Renaming … = a 1 *p 1 = … v 2  (v 1 ) a 2  s (a 1 ) b 2  (b 1 ) … = a 1 (b) Speculative Renaming The target set of *p generated by the compiler is {a, b} and v is the virtual variable for *p. The target set of *p generated by the alias profiling is {b}.

Example of Speculative Code Motion … = a 1 *p 1 = … v 4  s (v 3 ) a 2  s (a 1 ) b 4  (b 3 ) … = a 1 (a) Before Code Motion t 1 = a 1 (ld.a) … *p 1 = … v 4  (v 3 ) a 2  s (a 1 ) b 4  (b 3 ) t 4 = a 1 (ld.c) … (b) Final Output

Implementation  Open Research Compiler v1.1  Benchmark – Spec2000 C programs  Platform –HP i2000, 733 MHz Itanium processor, 1GB SDRAM –Redhat Linux v7.1  Pfmon v1.1

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} A[][][] and v[][] are not promoted to registers due to possible alias with w[][].

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Promoting A[][][] and v[][] to registers using ALAT improves this Procedure by 10%

Example from Equake Call site: smvp(,,,, disp[dispt], disp[disptplus]); void smvp(int nodes, double ***A, int *Acol, int *Aindex, double **v, double **w) {... for (i = 0; i < nodes; i++) {... for (i = 0; i < nodes; i++) {... while (Anext < Alast) { while (Anext < Alast) { col = Acol[Anext]; col = Acol[Anext]; sum0 += A[Anext][0][0] *… sum0 += A[Anext][0][0] *… sum1+= A[Anext][1][1] *… sum1+= A[Anext][1][1] *… sum2+= A[Anext][2][2] *… sum2+= A[Anext][2][2] *… w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][0] += A[Anext][0][0]*v[i][0] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][1] += A[Anext][1][1]*v[i][1] + … w[col][2] += A[Anext][2][2]*v[i][2] + … w[col][2] += A[Anext][2][2]*v[i][2] + … Anext++; Anext++; }}} }}} Using heuristic rules, our compiler can promote both ***A and **v to registers. But using alias profile, our compiler fails to promote **v, because at the call site v and w are passed with the same array name.

Performance Improvement of Speculative Register Promotion

Effectiveness of Speculative Register Promotion

Performance Improvement of Speculative Register Promotion based on Heuristic Rules

Performance Improvement of Speculative Register Promotion on Itanium-2

Advantages of Using Heuristic Rules   Full coverage.   Input-insensitive.   Efficient.  Scalable.

A case for using Profiles DO 140 L = L3,L4, 2 DO 140 L = L3,L4, 2 Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)) = Q(IJ(L))+W1(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) Q(IJ(L)+1) = Q(IJ(L)+1)+W2(L)*QMLT(L) …. …. Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) Q(IJ(L+1))=Q(IJ(L+1))+W1(L+1)*QMLT(L+1) Q(IJ(L+1)+1)=Q(IJ(L+1)+1)+W2(L+1)*QMLT(L+1) …… …… 140 CONTINUE Heuristic rules think Q(IJ(L)) is different from Q(IJ(L+1)), but they are actually identical since IJ() is often sorted. e.g. 1,1,2,2,2,5,5,6,6,6,6,9,9,9