Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Similar presentations


Presentation on theme: "Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008."— Presentation transcript:

1 Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008

2 Introduction World is going Multicore – Sequential performance has hit the “Three Walls” Power Wall + Memory Wall + ILP Wall = Brick Wall [Patterson] Software development is about to get really hard – For performance, must utilize parallel processors – Multiple, non-portable granularities: SIMD vs Multi-Core vs Multi-Socket vs Local-Stores vs GPU...

3 The StreamIt Language

4 The StreamIt Compiler Strategy 1. Fuse Stateless filters – Eliminates communication and buffer copies 2. Data-Parallelize – Allow for concurrent execution of future work 3. Pipeline – Optimal partitioning of the work Coarsen Granularity Data Parallelize Software Pipeline Graphic from: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. Michael Gordon. ASPLOS 2006, San Jose, CA, October, 2006.

5 Audio Feature Extraction Frequency-warped spectrum – Human ear has better spectral resolution at lower frequencies – To get necessary spectral resolution need FFT of length ~1024 (23ms)‏ – With frequency warping only need length 32(~1ms)‏ Allows finer time resolution – Improves analysis of rhythmic patterns and fast transients.

6 Audio Feature Extraction Frequency-warped spectrum – Human ear has better spectral resolution at lower frequencies – To get necessary spectral resolution need FFT of length ~1024 (23ms)‏ – With frequency warping only need length 32(~1ms)‏ Allows finer time resolution – Improves analysis of rhythmic patterns and fast transients.

7 Stream Graph Produced by StreamIt compiler. Stream graph for a 4 point warped FFT. For an application could be up to 128 point. Fine-grained version of the algorithm.

8 Interesting Architectures: Intel Clovertown r41 in RAD cluster Dual-Socket X Quad Core, 2.66 GHz 4-wide pipelines

9 Interesting Architectures: AMD Opteron r28 in RAD cluster Dual Socket X Dual Core, 2.2 GHz, 3-wide pipelines

10 Interesting Architectures: UltraSparc T2 (Niagara2)‏ r45 in RAD Cluster 1 socket X 8-core X 8-threads, 1.4 GHz – two-wide in-order pipelines, 1 FPU per core

11 StreamIt Scaling Performance StreamIt compiler thinks its doing a good job. It isn't.

12 Portability Problems StreamIt code base in a primitive state – Unable to produce working code for Niagara – Frequently buggy even on x86 Multicores Bugs ranged from “simple” to “subtle” – Simple: Check return of fopen() for NULL – Subtle: Die nondeterministically with cryptic error message Lack of root access – Made meeting compiler dependencies difficult

13 Dual-Core Performance As a sanity check, ran on Core 2 Duo T9300 (Penryn)‏ – Eric’s Laptop, root access, (w00t)‏ Coarse-grained implementation designed for easy parallelization – Shows speedup with –O1 flag but is actually faster with –O0 Fine-grained version shows definite slowdown. – But performs much faster than coarse-grained. – Compilation failed at 64 and 128 (out of memory)‏

14 Dual-Core Performance Best runs of each implementation compared – Coarse: 2 Cores (-O0)‏ – Fine: 1 Core (-O1)‏ – Control: Hand-optimized Matlab implementation, 2 threads Single processor StreamIt performance very acceptable. – Considering the productivity of the StreamIt language and the performance of the automatically-optimized Why can’t the multi-core code work this well yet on x86 architectures? ―No, seriously, I want it now. ―

15 Conclusions High-Level, Domain-Specific programming environments – Productivity win: we didn't have to write streaming communication library Compile-time transformations, lowering to C++ – Efficiency win: remove run-time overhead of High- Level language StreamIt optimizations incomplete – Agnostic of cache-coherence/communication topologies


Download ppt "Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008."

Similar presentations


Ads by Google