Download presentation
Presentation is loading. Please wait.
Published byStanley Gaines Modified over 8 years ago
1
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University
2
Motivation
3
Stream Programming Stream Programming model –filter: a piece of sequential programming –channels: how filters communicate –token: an indivisible unit of data for a filter Examples: Signal processing, image processing, embedded applications CB A
4
Example: Constant False-Alarm Rate (CFAR) Detection guard cells cell under test compare to with additional factors number of gates threshold (μ) number of guard cells rows and other dimensional data http://www.ll.mit.edu/HPECchallenge/ HPEC Challenge
5
123456789101112131415 uInt to float square right window left window add align data find targets CFAR Benchmark Data Stream:
6
uInt to float square right window left window add align data find targets 1 CFAR Benchmark 123456789101112131415
7
uInt to float square right window left window add align data find targets 21 CFAR Benchmark 123456789101112131415
8
uInt to float square right window left window add align data find targets 1 231 CFAR Benchmark 123456789101112131415
9
cell under test uInt to float square right window left window add align data find targets 1 67891011 23 91011 12 13 8 CFAR Benchmark 123456789101112131415 t right window left window t
10
Mapping Throughput: rate of processing data tokens multi-core chip core 1 A core 2 B core 3 C CB A
11
Mapping Throughput: rate of processing data tokens multi-core chip core 1 A B core 3 C CB A
12
core 1core 2core 3 Unbalanced Flow Bottlenecks reduce throughput –Caused by backpressure inherent algorithmic imbalances data-dependent computational spikes CB A “wait”
13
Data Dependent Execution Time: CFAR Using Set 1 from the HPEC CFAR Kernel Benchmark Over a block of about 100 cells Extra workload of 32 microseconds per target
14
Data Dependent Execution Time Some other examples: Bloom Filters (Financial, Spam detection) Compression (Image processing)
15
Our Solution: Flexible Filters Unused cycles on B are filled by working ahead on filter C with the data already present on B Push stream flow upstream of a bottleneck Semantic Preservation B A C C core 1core 2core 3 flex-split flex-merge
16
Related Works Static Compiler Optimizations –StreamIt [Gordon et al, 2006] Dynamic Runtime Load Balancing –Work Stealing/Filter Migration [Kakulavarapu et al., 2001] Cilk [Frigo et al., 1998] Flux [Shah et al., 2003] Borealis [Xing et al., 2005] –Queue based load balancing Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering Combination Static+Dynamic –FlexStream [Hormati et al., 2009] multiple competing programs
17
Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study
18
Design Flow of a Stream Program with Flexible Filters compiler/ design tools Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile
19
Design Considerations Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile
20
Adding Flexibility Design stream algorithm Mapping Filters Memory Add Flexibility to Bottlenecks Profile
21
Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study
22
Mapping Stream Programs to Multi-Core Platforms BAC BAC BAC BAC pipeline mapping SPMD mapping BAC sharing a core core 1 core 2core 3 core 2 core 1
23
Throughput: SPMD Mapping BAC BAC BAC SPMD mapping t0t1t2t3t4t5t6 core 1 core 2 core 3 A1 A2 A3 B1 B2 B3 C1 C2 C3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 Suppose E A = 2, E B = 2, E C = 3 core 2 core 1 core 3 2 3 1 E A =2 E B =2 E C =3
24
Throughput: Pipeline Mapping C B A core 1core 2 core 3 t0t1t2t3t4t5t6t7t8t9 core 1 core 2 core 3 latency 2 latency 3 1 A1 A2 B1 C1 B2 A3 C2 B3 A4 1 2 1 2 3 2 3 4 throughput = 1/3 = 0.333 < 0.429
25
Data Blocks CB A core 1core 2core 3 C data block: group of data tokens
26
Throughput: Pipeline Augmented with Flexibility C B A core 1core 2 core 3 C t0t1t2t3t4t5t6t7t8t9 core 1 core 2 core 3 A1 A2 B1 C1 B2 A3 C2 B3 A4 C2 throughput = 2/5 = 0.4 0.333) 1 1 2 1 2 3 2 3 4
27
Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study
28
Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select C B core 2core 3 C flex split flex merge C C out out1 in0 in1 select maintain ordering based on run-time state of queues out0 in
29
Flex-Merge Flex-merge pop i from select if i is 0, pop token from in0 if i is 1, pop token from in1 push token to out C B core 2core 3 C flex split flex merge C C out out1 in0 in1 select out0 in Overhead of Flexibility?
30
Multi-Channel Flex-Split and Flex-Merge flex splitfilter select … filter flex flex merge output channel 1 output channel 2 output channel n output channel 1 output channel 2 output channel n
31
Multi-Channel Flex-Split and Flex-Merge filter … input channel 1 input channel 2 input channel n
32
Multi-Channel Flex-Split and Flex-Merge filter … filter flex input channel 1 input channel 2 input channel n flex merge select flex split Centralized
33
Multi-Channel Flex-Split and Flex-Merge filter … filter flex input channel 1 input channel 2 input channel n flex merge select flex split filter … filter flex input channel 1 input channel 2 input channel n flex merge flex split Centralized β flex split select Distributed
34
Outline Introduction Design Flow of a Stream Program with Flexibility Performance Implementation of Flexible Filters Experiments –CFAR Case Study
35
Cell BE Processor Distributed Memory Heterogeneous –8 SIMD (SPU) cores –1 PowerPC (PPU) Element Interconnect Bus –4 rings –205 Gb/s Gedae Programming Language Communication Layer
36
Gedae Commercial data-flow language and programming GUI Performance analysis tools
37
CFAR Benchmark uInt to float square right window left window add align data find targets
38
Data Dependency By changing threshold, change % targets –1.3 % –7.3 % Additional workload per target –16 µs –32 µs –64 µs 1.37.3 16 μs0.821.45 32 μs1.061.39 64 μs1.271.47 % Targets Additional Workload
39
More Benchmarks BenchmarkFieldResults Dedup Information Theory Rabin block/ max chunk sizeSpeedup 4096/5122.00 JPEGImage Processing Image width x height 128x128 256x256 512x512 1.31 1.16 1.25 Value-at- Risk Finance stocks/walks/timesteps 16/1024/1024 64/1024/1024 128/1024/1024 0.98 1.56 1.55
40
Conclusions Flexible filters –adapt to data dependent bottlenecks –distributed load balancing –provide speedup without modification to original filters –can be implemented on top of general stream languages
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.