Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.

Slides:



Advertisements
Similar presentations
Concurrency Issues Motivation, Problems, Directions Dennis Kafura - CS Operating Systems1.
Advertisements

DSPs Vs General Purpose Microprocessors
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Computer Performance Computer Engineering Department.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
CS533 Concepts of Operating Systems Jonathan Walpole.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Michael I. Gordon, William Thies, and Saman Amarasinghe
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Parallel Programming Models
Appendix C Graphics and Computing GPUs
These slides are based on the book:
CS203 – Advanced Computer Architecture
Concurrent Data Structures for Near-Memory Computing
Parallel Programming By J. H. Wang May 2, 2017.
Morgan Kaufmann Publishers
Parallel Computers Today
Multicore / Multiprocessor Architectures
All-Pairs Shortest Paths
Multicore and GPU Programming
Presentation transcript:

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008

Introduction World is going Multicore – Sequential performance has hit the “Three Walls” Power Wall + Memory Wall + ILP Wall = Brick Wall [Patterson] Software development is about to get really hard – For performance, must utilize parallel processors – Multiple, non-portable granularities: SIMD vs Multi-Core vs Multi-Socket vs Local-Stores vs GPU...

The StreamIt Language

The StreamIt Compiler Strategy 1. Fuse Stateless filters – Eliminates communication and buffer copies 2. Data-Parallelize – Allow for concurrent execution of future work 3. Pipeline – Optimal partitioning of the work Coarsen Granularity Data Parallelize Software Pipeline Graphic from: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. Michael Gordon. ASPLOS 2006, San Jose, CA, October, 2006.

Audio Feature Extraction Frequency-warped spectrum – Human ear has better spectral resolution at lower frequencies – To get necessary spectral resolution need FFT of length ~1024 (23ms)‏ – With frequency warping only need length 32(~1ms)‏ Allows finer time resolution – Improves analysis of rhythmic patterns and fast transients.

Audio Feature Extraction Frequency-warped spectrum – Human ear has better spectral resolution at lower frequencies – To get necessary spectral resolution need FFT of length ~1024 (23ms)‏ – With frequency warping only need length 32(~1ms)‏ Allows finer time resolution – Improves analysis of rhythmic patterns and fast transients.

Stream Graph Produced by StreamIt compiler. Stream graph for a 4 point warped FFT. For an application could be up to 128 point. Fine-grained version of the algorithm.

Interesting Architectures: Intel Clovertown r41 in RAD cluster Dual-Socket X Quad Core, 2.66 GHz 4-wide pipelines

Interesting Architectures: AMD Opteron r28 in RAD cluster Dual Socket X Dual Core, 2.2 GHz, 3-wide pipelines

Interesting Architectures: UltraSparc T2 (Niagara2)‏ r45 in RAD Cluster 1 socket X 8-core X 8-threads, 1.4 GHz – two-wide in-order pipelines, 1 FPU per core

StreamIt Scaling Performance StreamIt compiler thinks its doing a good job. It isn't.

Portability Problems StreamIt code base in a primitive state – Unable to produce working code for Niagara – Frequently buggy even on x86 Multicores Bugs ranged from “simple” to “subtle” – Simple: Check return of fopen() for NULL – Subtle: Die nondeterministically with cryptic error message Lack of root access – Made meeting compiler dependencies difficult

Dual-Core Performance As a sanity check, ran on Core 2 Duo T9300 (Penryn)‏ – Eric’s Laptop, root access, (w00t)‏ Coarse-grained implementation designed for easy parallelization – Shows speedup with –O1 flag but is actually faster with –O0 Fine-grained version shows definite slowdown. – But performs much faster than coarse-grained. – Compilation failed at 64 and 128 (out of memory)‏

Dual-Core Performance Best runs of each implementation compared – Coarse: 2 Cores (-O0)‏ – Fine: 1 Core (-O1)‏ – Control: Hand-optimized Matlab implementation, 2 threads Single processor StreamIt performance very acceptable. – Considering the productivity of the StreamIt language and the performance of the automatically-optimized Why can’t the multi-core code work this well yet on x86 architectures? ―No, seriously, I want it now. ―

Conclusions High-Level, Domain-Specific programming environments – Productivity win: we didn't have to write streaming communication library Compile-time transformations, lowering to C++ – Efficiency win: remove run-time overhead of High- Level language StreamIt optimizations incomplete – Agnostic of cache-coherence/communication topologies