Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

DANBI: Dynamic Scheduling of Irregular Stream Programs for Many-Core Systems Changwoo Min and Young Ik Eom Sungkyunkwan University, Korea DANBI is a Korean.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Grand Challenge: The BlueBay Soccer Monitoring Engine Hans-Arno Jacobsen Kianoosh Mokhtarian Tilmann Rabl Mohammad.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
Message Passing Fundamentals Self Test. 1.A shared memory computer has access to: a)the memory of other nodes via a proprietary high- speed communications.
Modifying the SCSI / Fibre Channel Block Size Presented by Keith Bonneau, John Chrzanowski and Craig O’Brien Advised by Robert Kinicki and Mark Claypool.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
May 16-18, Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing IAPR Conference on Machine Vision Applications Wouter.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Processor Architecture
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
The World Leader in High Performance Signal Processing Solutions Multi-core programming frameworks for embedded systems Kaushal Sanghai and Rick Gentile.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
ICDCS 05Adaptive Counting Networks Srikanta Tirthapura Elec. And Computer Engg. Iowa State University.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
4- Performance Analysis of Parallel Programs
Dynamo: A Runtime Codesign Environment
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
CDA 3101 Spring 2016 Introduction to Computer Organization
Pipelining and Vector Processing
Improved schedulability on the ρVEX polymorphic VLIW processor
From C to Elastic Circuits
Dynamically Scheduled High-level Synthesis
Parallel Programming in C with MPI and OpenMP
Multicore and GPU Programming
Presentation transcript:

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University

Motivation

Stream Programming Stream Programming model –filter: a piece of sequential programming –channels: how filters communicate –token: an indivisible unit of data for a filter Examples: Signal processing, image processing, embedded applications CB A

Example: Constant False-Alarm Rate (CFAR) Detection guard cells cell under test compare to with additional factors number of gates threshold (μ) number of guard cells rows and other dimensional data HPEC Challenge

uInt to float square right window left window add align data find targets CFAR Benchmark Data Stream:

uInt to float square right window left window add align data find targets 1 CFAR Benchmark

uInt to float square right window left window add align data find targets 21 CFAR Benchmark

uInt to float square right window left window add align data find targets CFAR Benchmark

cell under test uInt to float square right window left window add align data find targets CFAR Benchmark t right window left window t

Mapping Throughput: rate of processing data tokens multi-core chip core 1 A core 2 B core 3 C CB A

Mapping Throughput: rate of processing data tokens multi-core chip core 1 A B core 3 C CB A

core 1core 2core 3 Unbalanced Flow Bottlenecks reduce throughput –Caused by backpressure inherent algorithmic imbalances data-dependent computational spikes CB A “wait”

Data Dependent Execution Time: CFAR Using Set 1 from the HPEC CFAR Kernel Benchmark Over a block of about 100 cells Extra workload of 32 microseconds per target

Data Dependent Execution Time Some other examples: Bloom Filters (Financial, Spam detection) Compression (Image processing)

Our Solution: Flexible Filters Unused cycles on B are filled by working ahead on filter C with the data already present on B Push stream flow upstream of a bottleneck Semantic Preservation B A C C core 1core 2core 3 flex-split flex-merge

Related Works Static Compiler Optimizations –StreamIt [Gordon et al, 2006] Dynamic Runtime Load Balancing –Work Stealing/Filter Migration [Kakulavarapu et al., 2001] Cilk [Frigo et al., 1998] Flux [Shah et al., 2003] Borealis [Xing et al., 2005] –Queue based load balancing Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering Combination Static+Dynamic –FlexStream [Hormati et al., 2009] multiple competing programs

Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study

Design Flow of a Stream Program with Flexible Filters compiler/ design tools Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile

Design Considerations Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile

Adding Flexibility Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile

Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study

Mapping Stream Programs to Multi-Core Platforms BAC BAC BAC BAC pipeline mapping SPMD mapping BAC sharing a core core 1 core 2core 3 core 2 core 1

Throughput: SPMD Mapping BAC BAC BAC SPMD mapping t0t1t2t3t4t5t6 core 1 core 2 core 3 A1 A2 A3 B1 B2 B3 C1 C2 C3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = Suppose E A = 2, E B = 2, E C = 3 core 2 core 1 core E A =2 E B =2 E C =3

Throughput: Pipeline Mapping C B A core 1core 2 core 3 t0t1t2t3t4t5t6t7t8t9 core 1 core 2 core 3 latency 2 latency 3 1 A1 A2 B1 C1 B2 A3 C2 B3 A throughput = 1/3 = < 0.429

Data Blocks CB A core 1core 2core 3 C data block: group of data tokens

Throughput: Pipeline Augmented with Flexibility C B A core 1core 2 core 3 C t0t1t2t3t4t5t6t7t8t9 core 1 core 2 core 3 A1 A2 B1 C1 B2 A3 C2 B3 A4 C2 throughput = 2/5 = )

Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study

Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select C B core 2core 3 C flex split flex merge C C out out1 in0 in1 select maintain ordering based on run-time state of queues out0 in

Flex-Merge Flex-merge pop i from select if i is 0, pop token from in0 if i is 1, pop token from in1 push token to out C B core 2core 3 C flex split flex merge C C out out1 in0 in1 select out0 in Overhead of Flexibility?

Multi-Channel Flex-Split and Flex-Merge flex splitfilter select … filter flex flex merge output channel 1 output channel 2 output channel n output channel 1 output channel 2 output channel n

Multi-Channel Flex-Split and Flex-Merge filter … input channel 1 input channel 2 input channel n

Multi-Channel Flex-Split and Flex-Merge filter … filter flex input channel 1 input channel 2 input channel n flex merge select flex split Centralized

Multi-Channel Flex-Split and Flex-Merge filter … filter flex input channel 1 input channel 2 input channel n flex merge select flex split filter … filter flex input channel 1 input channel 2 input channel n flex merge flex split Centralized β flex split select Distributed

Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study

Cell BE Processor Distributed Memory Heterogeneous –8 SIMD (SPU) cores –1 PowerPC (PPU) Element Interconnect Bus –4 rings –205 Gb/s Gedae Programming Language Communication Layer

Gedae Commercial data-flow language and programming GUI Performance analysis tools

CFAR Benchmark uInt to float square right window left window add align data find targets

Data Dependency By changing threshold, change % targets –1.3 % –7.3 % Additional workload per target –16 µs –32 µs –64 µs μs μs μs % Targets Additional Workload

More Benchmarks BenchmarkFieldResults Dedup Information Theory Rabin block/ max chunk sizeSpeedup 4096/ JPEGImage Processing Image width x height 128x x x Value-at- Risk Finance stocks/walks/timesteps 16/1024/ /1024/ /1024/

Conclusions Flexible filters –adapt to data dependent bottlenecks –distributed load balancing –provide speedup without modification to original filters –can be implemented on top of general stream languages