Download presentation
Presentation is loading. Please wait.
Published bySurya Susanto Modified over 6 years ago
1
What is the WHT anyway, and why are there so many ways to compute it?
Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…
2
Walsh-Hadamard Transform
y = WHTN x, N = 2n
3
WHT Algorithms Factor WHTN into a product of sparse structured matrices Compute: y = (M1 M2 … Mt)x yt = Mtx … y2 = M2 y3 y = M1 y2
4
Factoring the WHT Matrix
AC Ä BD = (A Ä B)(C Ä D) A Ä B = (A Ä I)(I Ä B) A Ä (B Ä C) = (A Ä B) Ä C Im Ä In = Imn WHT2 Ä WHT2 = (WHT2 Ä I2)(I2 Ä WHT2)
5
Recursive and Iterative Factorization
WHT8 = (WHT2 Ä I4)(I2 Ä WHT4) = (WHT2 Ä I4)(I2 Ä ((WHT2 Ä I2) (I2 Ä WHT2))) = (WHT2 Ä I4)(I2 Ä (WHT2 Ä I2)) (I2 Ä (I2 Ä WHT2)) = (WHT2 Ä I4)(I2 Ä (WHT2 Ä I2)) ((I2 Ä I2) Ä WHT2) = (WHT2 Ä I4)(I2 Ä WHT2 Ä I2) ((I2 Ä I2) Ä WHT2) = (WHT2 Ä I4)(I2 Ä WHT2 Ä I2) (I4 Ä WHT2)
6
Recursive Algorithm WHT8 = (WHT2 I4)(I2 (WHT2 I2)(I2 WHT2))
7
Iterative Algorithm WHT8 = (WHT2 I4)(I2 WHT2 I2)(I4 WHT2) ú û
ù ê ë é - = 1 WHT8 = (WHT2 I4)(I2 WHT2 I2)(I4 WHT2)
8
WHT Algorithms Recursive Iterative General
9
WHT Implementation Õ WHT ( I Ä WHT Ä I ) R=N; S=1; Definition/formula
N=N1* N2Nt Ni=2ni x=WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s)) Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni for j=0,…,R-1 for k=0,…,S-1 S=S* Ni; M b,s t Õ WHT = ( I Ä WHT Ä I ) n i +1 + ··· + t 2 n 2 n 1 + ··· + i - 2 n i 2 i = 1
10
Partition Trees 9 3 4 2 1 Left Recursive Right Recursive 4 1 3 2 4 1 3
Balanced 4 2 1 Iterative 4 1
11
Ordered Partitions There is a 1-1 mapping from ordered partitions of n onto (n-1)-bit binary numbers. There are 2n-1 ordered partitions of n. 162 = 1|1 1| |1 1 = 9
12
Enumerating Partition Trees
00 01 01 3 3 3 2 1 2 1 1 1 10 10 11 3 3 1 2 3 1 2 1 1 1
13
Search Space Optimization of the WHT becomes a search, over the space of partition trees, for the fastest algorithm. The number of trees:
14
Size of Search Space Let T(z) be the generating function for Tn
Tn = (n/n3/2), where =4+8 Restricting to binary trees Tn = (5n/n3/2)
15
WHT Package Püschel & Johnson (ICASSP ’00)
Allows easy implementation of any of the possible WHT algorithms Partition tree representation W(n)=small[n] | split[W(n1),…W(nt)] Tools Measure runtime of any algorithm Measure hardware events Search for good implementation Dynamic programming Evolutionary algorithm
16
Histogram (n = 16, 10,000 samples)
Wide range in performance despite equal number of arithmetic operations (n2n flops) Pentium III consumes more run time (more pipeline stages) Ultra SPARC II spans a larger range
17
Operation Count Theorem. Let WN be a WHT algorithm of size N. Then the number of floating point operations (flops) used by WN is Nlg(N). Proof. By induction.
18
Instruction Count Model
A(n) = number of calls to WHT procedure = number of instructions outside loops Al(n) = Number of calls to base case of size l l = number of instructions in base case of size l Li = number of iterations of outer (i=1), middle (i=2), and outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body
19
Small[1] .file "s_1.c" .version "01.01" gcc2_compiled.: .text .align 4
.globl apply_small1 .type apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret
20
Recurrences
21
Recurrences
22
Histogram using Instruction Model (P3)
l = 12, l = 34, and l = 106 = 27 1 = 18, 2 = 18, and 1 = 20
23
Algorithm Comparison
24
Dynamic Programming ), Cost( min n n1+… + nt= n … T T
where Tn is the optimial tree of size n. This depends on the assumption that Cost only depends on the size of a tree and not where it is located. (true for IC, but false for runtime). For IC, the optimal tree is iterative with appropriate leaves. For runtime, DP is a good heuristic (used with binary trees).
25
Optimal Formulas UltraSPARC Pentium [1], [2], [3], [4], [5], [6]
[7] [[4],[4]] [[5],[4]] [[5],[5]] [[5],[6]] [[2],[[5],[5]]] [[2],[[5],[6]]] [[2],[[2],[[5],[5]]]] [[2],[[2],[[5],[6]]]] [[2],[[2],[[2],[[5],[5]]]]] [[2],[[2],[[2],[[5],[6]]]]] [[2],[[2],[[2],[[2],[[5],[5]]]]]] [[2],[[2],[[2],[[2],[[5],[6]]]]]] [[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]] UltraSPARC [1], [2], [3], [4], [5], [6] [[3],[4]] [[4],[4]] [[4],[5]] [[5],[5]] [[5],[6]] [[4],[[4],[4]]] [[[4],[5]],[4]] [[4],[[5],[5]]] [[[5],[5]],[5]] [[[5],[5]],[6]] [[4],[[[4],[5]],[4]]] [[4],[[4],[[5],[5]]]] [[4],[[[5],[5]],[5]]] [[5],[[[5],[5]],[5]]]
26
Different Strides Dynamic programming assumption is not true. Execution time depends on stride.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.