Presentation is loading. Please wait.

Presentation is loading. Please wait.

MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry.

Similar presentations


Presentation on theme: "MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry."— Presentation transcript:

1 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

2 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Problem Statement S3P Program Outline Introduction Design Demonstration Results Summary

3 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt PCA Need: System Level Optimization Filter X OUT = FIR(X IN ) Detect X OUT = |X IN |>c Beamform X OUT = w *X IN Signal Processing Application (made up of PCA components) Morphware Hardware Software Components AB Applications Applications built with components Components have a defined scope Capable of local optimization System requires global optimization Not visible to components Too complex to add to application Need system level optimization capabilities as part of PCA Applications built with components Components have a defined scope Capable of local optimization System requires global optimization Not visible to components Too complex to add to application Need system level optimization capabilities as part of PCA

4 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Example: Optimum System Latency Local Optimum Beamform Filter Latency < 8 Hardware < 32 Hardware Units (N) Latency Component Latency Hardware < 32 Latency < 8 Filter Hardware Beamform Hardware System Latency Global Optimum Beamform Latency = 2/N Filter Latency = 1/N Simple two component system Local optimum fails to satisfy global constraints Need system view to find global optimum Simple two component system Local optimum fails to satisfy global constraints Need system view to find global optimum

5 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt System Optimization Challenge Filter X OUT = FIR(X IN ) Detect X OUT = |X IN |>c Beamform X OUT = w *X IN Signal Processing Application Compute Fabric (Cluster, FPGA, SOC …) Optimizing to system constraints requires two way component/system knowledge exchange Need a framework to mediate exchange and perform system level optimization Optimizing to system constraints requires two way component/system knowledge exchange Need a framework to mediate exchange and perform system level optimization Optimal Resource Allocation (Latency, Throughput, Memory, Bandwidth …)

6 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P Lincoln Internal R&D Program Parallel Signal Processing Kepner/Hoffmann (Lincoln) Goal: applications that self-optimize to any hardware Combine LL system expertise and LCS FFTW approach Self- Optimizing Software Leiserson/Frigo (MIT LCS) S 3 P brings self-optimizing (FFTW) approach to parallel signal processing systems Framework exploits graph theory abstraction Broadly applicable to system optimization problems Defines clear component and system requirements S 3 P Framework 1 1 2 M 2 N... Processor Mappings Algorithm Stages Time & Verify Best Mappings

7 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Requirements Graph Theory Outline Introduction Design Demonstration Results Summary

8 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt System Requirements Each compute stage can be mapped to different sets of hardware and timed Filter X OUT = FIR(X IN ) Detect X OUT = |X IN |>c Beamform X OUT = w *X IN Mappable to different sets of hardware Measurable resource usage of each mapping Decomposable into Tasks (comp) and Conduits (comm)

9 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt System Graph BeamformFilterDetect Node is a unique mapping of a task Edge is a conduit between a pair of task mappings System Graph can store the hardware resource usage of every possible Task & Conduit

10 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Path = System Mapping BeamformFilterDetect Each path is a complete system mapping “Best” Path is the optimal system mapping Graph construct is very general and widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming Graph construct is very general and widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

11 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Example: Maximize Throughput BeamformFilterDetect Node stores task time for a each mapping Goal: Maximize throughput and minimize hardware Choose path with the smallest bottleneck that satisfies hardware constraint Goal: Maximize throughput and minimize hardware Choose path with the smallest bottleneck that satisfies hardware constraint Edge stores conduit time for a given pair of mappings 1.5 3.0 6.0 2.0 4.0 8.0 16.0 3.0 6.0 4.0 3.0 2.0 1.0 4.0 3.0 2.0 1.0 4.0 3.0 2.0 1.0 4.0 3.0 4.0 3.0 2.0 33 23 More Hardware 3.04.0 3.0

12 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Dijkstra’s Algorithm Dynamic Programming Path Finding Algorithms N = total hardware units M = number of tasks P i = number of mappings for task i t = M pathTable[M][N] = all infinite weight paths for( j:1..M ){ for( k:1..P j ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } t = t - 1 } Graph construct is very general Widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints) such as Dikkstra’s Algorithm and Dynamic Programming Graph construct is very general Widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints) such as Dikkstra’s Algorithm and Dynamic Programming Initialize Graph G Initialize source vertex s Store all vertices of G in a minimum priority queue Q while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u

13 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S3P Inputs and Outputs Hardware Information Algorithm Information System Constraints System Constraints Application S3P Framework “Best” System Mapping “Best” System Mapping Required Optional Can flexibly add information about Application Algorithm System Hardware Can flexibly add information about Application Algorithm System Hardware

14 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Application Middleware Hardware S3P Outline Introduction Design Demonstration Results Summary

15 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S3P Demonstration Testbed Multi-Stage Application Hardware (Workstation Cluster) Input Low Pass Filter Beamform Matched Filter Middleware (PVL) Map Task Conduit S3P Engine

16 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Multi-Stage Application Input X IN Low Pass Filter X IN W1W1 W1W1 FIR1 X OUT W2W2 W2W2 FIR2 Beamform X IN W3W3 W3W3 mult X OUT Matched Filter X IN W4W4 W4W4 FFT IFFT X OUT Features “Generic” radar/sonar signal processing chain Utilizes key kernels (FIR, matrix multiply, FFT and corner turn) Scalable to any problem size (fully parameterize algorithm) Self validates (built-in target generator) Features “Generic” radar/sonar signal processing chain Utilizes key kernels (FIR, matrix multiply, FFT and corner turn) Scalable to any problem size (fully parameterize algorithm) Self validates (built-in target generator)

17 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Signal Processing & Control Mapping Parallel Vector Library (PVL) Data & TaskPerforms signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR) Computation DataUsed to perform matrix/vector algebra on data spanning multiple processors Matrix/Vector Task & Pipeline Supports data movement between tasks (i.e. the arrows on a signal flow diagram) Conduit Task & Pipeline Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task Organizes processors into a 2D layoutGrid Data, Task & Pipeline Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor Map ParallelismDescriptionClass Simple mappable components support data, task and pipeline parallelism

18 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Hardware Platform Network of 8 Linux workstations –Dual 800 MHz Pentium III processors Communication –Gigabit ethernet, 8-port switch –Isolated network Software –Linux kernel release 2.2.14 –GNU C++ Compiler –MPICH communication library over TCP/IP Advantages Software tools Widely available Inexpensive (high Mflops/$) Excellent rapid prototyping platform Disadvantages Non real-time OS Non real-time messaging Slower interconnect Difficulty to model SMP behavior erratic

19 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S3P Engine Hardware Information Algorithm Information System Constraints System Constraints Application Program Application Program S3P Engine “Best” System Mapping “Best” System Mapping Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Selector searches the system graph for the optimal set of maps Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Selector searches the system graph for the optimal set of maps Map Generator Map Timer Map Selector

20 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Simulated/Predicted/Measured Optimal Mappings Validation and Verification Outline Introduction Design Demonstration Results Summary

21 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Optimal Throughput InputLow Pass Filter Beamform Matched Filter 3.231.5 1.415.7 1.010.4 0.78.2 16.131.4 9.818.0 6.513.7 3.311.5 52 49 46 42 47 27 21 24 44 29 20 24 60 33 23 15 12 31 - 57 16 17 - 28 14 9.1 - 18 15 - 14 8.3 8.7 3.3 2.6 7.3 8.3 9.4 8.0 - - - - 17 14 13 Best 30 msec (1.6 MHz BW) Best 15 msec (3.2 MHz BW) Vary number of processors used on each stage Time each computation stage and communication conduit Find path with minimum bottleneck Vary number of processors used on each stage Time each computation stage and communication conduit Find path with minimum bottleneck 1 CPU 2 CPU 3 CPU 4 CPU

22 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S3P Timings (4 cpu max) Tasks 4 CPU 3 CPU 2 CPU 1 CPU Input Low Pass Filter Beamform Matched Filter Graphical depiction of timings (wider is better)

23 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Input S3P Timings (12 cpu max) (wider is better) Low Pass FilterBeamformMatched Filter Tasks 12 CPU 8 CPU 6 CPU 4 CPU 2 CPU Large amount of data requires algorithm to find best path

24 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Predicted and Achieved Latency (4-8 cpu max) Find path that produces minimum latency for a given number of processors Excellent agreement between S3P predicted and achieved latencies Find path that produces minimum latency for a given number of processors Excellent agreement between S3P predicted and achieved latencies Maximum Number of Processors Latency (sec) Large (48x128K) Problem SizeSmall (48x4K) Problem Size

25 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Predicted and Achieved Throughput (4-8 cpu max) Maximum Number of Processors Throughput (pulses/sec) Throughput (pulse/sec) Large (48x128K) Problem SizeSmall (48x4K) Problem Size Find path that produces maximum throughput for a given number of processors Excellent agreement between S3P predicted and achieved throughput Find path that produces maximum throughput for a given number of processors Excellent agreement between S3P predicted and achieved throughput

26 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt SMP Results (16 cpu max) SMP overstresses Linux Real Time capabilities Poor overall system performance Divergence between predicted and measured SMP overstresses Linux Real Time capabilities Poor overall system performance Divergence between predicted and measured Maximum Number of Processors Throughput (pulse/sec) Large (48x128K) Problem Size

27 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Simulated (128 cpu max) Simulator allows exploration of larger systems Maximum Number of Processors Throughput (pulses/sec) Latency (sec) Small (48x4K) Problem Size

28 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Reducing the Search Space - Algorithm Comparison- Graph algorithms provide baseline performance Hill Climbing performance varies as a function of initialization and neighborhood definition Preprocessor outperforms all other algorithms. Maximum Number of Processors Number of Timings Required

29 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Future Work Program area –Determine how to incorporate global optimization into other middleware efforts (e.g. PCA, HPEC-SI, …) Hardware area –Scale and demonstrate on larger/real-time system HPCMO Mercury system at WPAFB Expect even better results than on Linux cluster –Apply to parallel hardware RAW Algorithm area –Exploits ways of reducing search space –Provide solution “families” via sensitivity analysis

30 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Outline Introduction Design Demonstration Results Summary

31 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Summary System level constraints (latency, throughput, hardware size, …) necessitate system level optimization Application requirements for system level optimization are –Decomposable into components (input, filtering, output, …) –Mappable to different configurations (# processors, # links, …) –Measureable resource usage (time, memory, …) S3P demonstrates global optimization is feasible separate from the application

32 MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt Acknowldegements Matteo Frigo (MIT/LCS & Vanu, Inc.) Charles Leiserson (MIT/LCS) Adam Wierman (CMU)


Download ppt "MIT Lincoln Laboratory 010927-S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry."

Similar presentations


Ads by Google