Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond

Similar presentations


Presentation on theme: "Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond"— Presentation transcript:

1 S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures
Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA This work is sponsored by United States Air Force under Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense. This work is sponsored by DARPA/MTO under the VLSI Photonics program via Air Force Contract No. F C This report summarizes the work performed between the commencement of activities under this program in March 1999 and the end of September 1999. CONTAINS PROPRIETARY MATERIAL The material on the Northrop-Grumman optoelectronic interconnect in the section “Evaluation of optoelectronic interconnects” is proprietary, and covered by a non-disclosure agreement between Lincoln Laboratory and Northrop-Grumman. 1 1

2 Acknowldegements Matteo Frigo (MIT/LCS & Vanu, Inc.)
Charles Leiserson (MIT/LCS) Adam Wierman (CMU)

3 Outline Introduction Design Demonstration Results Summary
Problem Statement S3P Program Introduction Design Demonstration Results Summary The Introduction discusses the wide range of embedded parallel processing systems, the power and weight constraints on embedded systems, and identifies the subset of these systems that is of interest in this project. It also outlines the factors being considered in comparing optoelectronic interconnects to electronic interconnects, and identifies the technical contributions this project will make to the overall VLSI Photonics program.

4 Example: Optimum System Latency
Simple two component system Local optimum fails to satisfy global constraints Need system view to find global optimum Beamform Latency = 2/N Filter Latency = 1/N Local Optimum Beamform Filter Latency < 8 Hardware < 32 Hardware Units (N) Latency Component Latency Hardware < 32 Latency < 8 Filter Hardware Beamform Hardware System Latency Optimum Global The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” and is analogous to the “place and route” problem encountered by hardware designers. Handling of faults brings an additional level of complexity to this problem because mapping will have to be re-done dynamically while the program is running. PCAs add their own level of complexity, since the underlying hardware can also be changed, and so hardware morphological capabilities must be considered as another set of variables (with constraints) in the optimization/design space. The example above shows that for a limited number of resources (I.e. any practical application), the optimization of two (or more) components individually will not necessarily lead to an optimal system level solution. The job of determining the optimal system-level solution today falls first to the system architect (at design time) and then to the application builders/ implementers. The middleware provides efficient solutions at the component level, but the implmentor must handle the system level issues explicitly. This is often an extremely complex task. S3P addresses this middleware limitation by providing the capability to automatically generate, time and combine in an optimal manner, a set of candidate maps for each component, and therby generate a system solution that is globally efficient. The extension of S3P to handle not only parallel systems, but also PCAs, would be the goal of the proposed S3P-PCA project. 1 1

5 System Optimization Challenge
Signal Processing Application Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Optimal Resource Allocation (Latency, Throughput, Memory, Bandwidth …) Current S3P research has been restricted to parallel signal processing codes on parallel/distributed processors, implemented (for convenience to the reseacrh team) with Lincoln’s parallel vector library (PVL). S3P employs a model of the underlying architecture, and understands that legal mappings of the application to the model, are legal on the actual machine. S3P then uses the model, coupled with actual system timings, to determine, via graph analysis and dynamic programming techniques, the “best” solution (best = lowest latency and/or widest bottleneck). By replacing the simple parallel machine model used in S3P by a suitable PCA model, S3P can automate exploration of the PCA design space for a given PCA system, thereby facilitating hardware morphing, resource allocation, and system level application mapping. Compute Fabric (Cluster, FPGA, SOC …) Optimizing to system constraints requires two way component/system knowledge exchange Need a framework to mediate exchange and perform system level optimization 1 1

6 S3P Lincoln Internal R&D Program
Parallel Signal Processing Kepner/Hoffmann (Lincoln) Goal: applications that self-optimize to any hardware Combine LL system expertise and LCS FFTW approach S3P Framework Algorithm Stages Processor Mappings 1 2 N . . . S3P brings self-optimizing (FFTW) approach to parallel signal processing systems 1 2 Best Mappings . . . Time & Verify S3P combines parallel signal processing technology developed at Lincoln with self-optimizing software (ideas from FFTW) developed at the MIT LCS. Ultimately, S3P will combine optimum node-level codes with optimum system-level maps to provide a robust mapping/optimization tool for portable, parallel signal processing applications. S3P exploits the architecturally neutral, remapping capabilities of PVL to gather performance data for candidate component mappings on target architectures (performance estimate can also be used). Then, within a graph-theoretic framework it uses both dynamic programming and iterative search algorithms to determine system-level mapping solutions for various machine sizes and system constraints (latency and throughput). The machine model used has the potential to be extended to PCA architectures, so that S3P can be used to provide globally optimal, system-level solutions for PCA applications. M Self-Optimizing Software Leiserson/Frigo (MIT LCS) Framework exploits graph theory abstraction Broadly applicable to system optimization problems Defines clear component and system requirements 1 1

7 Outline Introduction Design Demonstration Results Summary Requirements
Graph Theory The Introduction discusses the wide range of embedded parallel processing systems, the power and weight constraints on embedded systems, and identifies the subset of these systems that is of interest in this project. It also outlines the factors being considered in comparing optoelectronic interconnects to electronic interconnects, and identifies the technical contributions this project will make to the overall VLSI Photonics program.

8 System Requirements Decomposable into Tasks (comp) and Conduits (comm)
Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Mappable to different sets of hardware The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” and is analagous to the “place and route” problem encountered by hardware designers. Handling of faults brings an additional level of complexity to this problem because mapping will have to be done dynamically while the program is running. Measurable resource usage of each mapping Each compute stage can be mapped to different sets of hardware and timed 1 1

9 System Graph Edge is a conduit between a pair of task mappings
Beamform Filter Detect Node is a unique mapping of a task Edge is a conduit between a pair of task mappings The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” and is analagous to the “place and route” problem encountered by hardware designers. Handling of faults brings an additional level of complexity to this problem because mapping will have to be done dynamically while the program is running. System Graph can store the hardware resource usage of every possible Task & Conduit 1 1

10 Path = System Mapping Each path is a complete system mapping
Beamform Filter Detect Each path is a complete system mapping “Best” Path is the optimal system mapping The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” and is analagous to the “place and route” problem encountered by hardware designers. Handling of faults brings an additional level of complexity to this problem because mapping will have to be done dynamically while the program is running. Graph construct is very general and widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming 1 1

11 Example: Maximize Throughput
Beamform Filter Detect Node stores task time for a each mapping 1.5 3.0 4.0 2.0 3.0 Edge stores conduit time for a given pair of mappings 4.0 3.0 2.0 1.0 4.0 3.0 2.0 33 23 3.0 4.0 6.0 6.0 8.0 More Hardware 16.0 The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” and is analagous to the “place and route” problem encountered by hardware designers. Handling of faults brings an additional level of complexity to this problem because mapping will have to be done dynamically while the program is running. Goal: Maximize throughput and minimize hardware Choose path with the smallest bottleneck that satisfies hardware constraint 1 1

12 Path Finding Algorithms
Graph construct is very general Widely used for optimization problems Many efficient techniques for choosing “best” path (under constraints) such as Dikkstra’s Algorithm and Dynamic Programming Initialize Graph G Initialize source vertex s Store all vertices of G in a minimum priority queue Q while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u N = total hardware units M = number of tasks Pi = number of mappings for task i t = M pathTable[M][N] = all infinite weight paths for( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p t = t - 1 Dijkstra’s Algorithm Dynamic Programming

13 Algorithm Information
S3P Inputs and Outputs Application Algorithm Information S3P Framework “Best” System Mapping Required Optional Can flexibly add information about Application Algorithm System Hardware System Constraints Hardware Information

14 Outline Introduction Design Demonstration Results Summary Application
Middleware Hardware S3P The Introduction discusses the wide range of embedded parallel processing systems, the power and weight constraints on embedded systems, and identifies the subset of these systems that is of interest in this project. It also outlines the factors being considered in comparing optoelectronic interconnects to electronic interconnects, and identifies the technical contributions this project will make to the overall VLSI Photonics program.

15 S3P Demonstration Testbed
Multi-Stage Application Input Low Pass Filter Beamform Matched Middleware (PVL) Map Task Conduit S3P Engine Hardware (Workstation Cluster) 1 1

16 Multi-Stage Application
Input XIN Low Pass Filter XIN W1 FIR1 XOUT W2 FIR2 Beamform XIN W3 mult XOUT Matched Filter XIN W4 FFT IFFT XOUT Features “Generic” radar/sonar signal processing chain Utilizes key kernels (FIR, matrix multiply, FFT and corner turn) Scalable to any problem size (fully parameterize algorithm) Self validates (built-in target generator) 1 1

17 Parallel Vector Library (PVL)
Data & Task Performs signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR) Computation Data Used to perform matrix/vector algebra on data spanning multiple processors Matrix/Vector Task & Pipeline Supports data movement between tasks (i.e. the arrows on a signal flow diagram) Conduit Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task Organizes processors into a 2D layout Grid Data, Task & Pipeline Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor Map Parallelism Description Class Signal Processing & Control This chart illustrates the core objects in the PVL library. The key idea in PVL is that all objects can be mapped, which allows the software to be independent of the number of processors used. Mapping Simple mappable components support data, task and pipeline parallelism

18 Hardware Platform Advantages Disadvantages
Network of 8 Linux workstations Dual 800 MHz Pentium III processors Communication Gigabit ethernet, 8-port switch Isolated network Software Linux kernel release GNU C++ Compiler MPICH communication library over TCP/IP Advantages Software tools Widely available Inexpensive (high Mflops/$) Excellent rapid prototyping platform Disadvantages Non real-time OS Non real-time messaging Slower interconnect Difficulty to model SMP behavior erratic

19 Algorithm Information
S3P Engine Application Program Algorithm Information S3P Engine “Best” System Mapping System Constraints Hardware Information Map Generator Map Timer Map Selector Map Generator constructs the system graph for all candidate mappings Map Timer times each node and edge of the system graph Map Selector searches the system graph for the optimal set of maps

20 Outline Introduction Design Demonstration Results Summary
Simulated/Predicted/Measured Optimal Mappings Validation and Verification The Introduction discusses the wide range of embedded parallel processing systems, the power and weight constraints on embedded systems, and identifies the subset of these systems that is of interest in this project. It also outlines the factors being considered in comparing optoelectronic interconnects to electronic interconnects, and identifies the technical contributions this project will make to the overall VLSI Photonics program.

21 Optimal Throughput 1 CPU 2 CPU 3 CPU 4 CPU
Vary number of processors used on each stage Time each computation stage and communication conduit Find path with minimum bottleneck Input Low Pass Filter Beamform Matched Filter 3.2 31.5 1.4 15.7 1.0 10.4 0.7 8.2 16.1 31.4 9.8 18.0 6.5 13.7 3.3 11.5 52 49 46 42 47 27 21 24 44 29 20 60 33 23 15 12 31 - 57 16 17 28 14 9.1 18 8.3 8.7 2.6 7.3 9.4 8.0 13 Best 30 msec (1.6 MHz BW) Best 15 msec (3.2 MHz BW) 1 CPU 2 CPU 3 CPU 4 CPU 1 1

22 S3P Timings (4 cpu max) 4 CPU 3 CPU 2 CPU 1 CPU Tasks
Graphical depiction of timings (wider is better) 4 CPU 3 CPU 2 CPU 1 CPU Input Low Pass Filter Beamform Matched Tasks 1 1

23 S3P Timings (12 cpu max) (wider is better)
Large amount of data requires algorithm to find best path Input Low Pass Filter Beamform Matched Filter Tasks 1 1

24 Predicted and Achieved Latency (4-8 cpu max)
Large (48x128K) Problem Size Small (48x4K) Problem Size Latency (sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors Find path that produces minimum latency for a given number of processors Excellent agreement between S3P predicted and achieved latencies

25 Predicted and Achieved Throughput (4-8 cpu max)
Large (48x128K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Throughput (pulse/sec) Maximum Number of Processors Maximum Number of Processors Find path that produces maximum throughput for a given number of processors Excellent agreement between S3P predicted and achieved throughput

26 SMP Results (16 cpu max) SMP overstresses Linux Real Time capabilities
Large (48x128K) Problem Size Throughput (pulse/sec) Maximum Number of Processors SMP overstresses Linux Real Time capabilities Poor overall system performance Divergence between predicted and measured

27 Simulated (128 cpu max) Simulator allows exploration of larger systems
Small (48x4K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors Simulator allows exploration of larger systems

28 Reducing the Search Space -Algorithm Comparison-
Graph algorithms provide baseline performance Hill Climbing performance varies as a function of initialization and neighborhood definition Number of Timings Required Preprocessor outperforms all other algorithms. Maximum Number of Processors

29 Future Work Program area Hardware area Algorithm area
Determine how to enable global optimization in other middleware efforts (e.g. PCA, HPEC-SI, …) Hardware area Scale and demonstrate on larger/real-time system HPCMO Mercury system at WPAFB Expect even better results than on Linux cluster Apply to parallel hardware MIT/LCS RAW Algorithm area Exploits ways of reducing search space Provide solution “families” via sensitivity analysis

30 Outline Introduction Design Demonstration Results Summary
The Introduction discusses the wide range of embedded parallel processing systems, the power and weight constraints on embedded systems, and identifies the subset of these systems that is of interest in this project. It also outlines the factors being considered in comparing optoelectronic interconnects to electronic interconnects, and identifies the technical contributions this project will make to the overall VLSI Photonics program.

31 Summary System level constraints (latency, throughput, hardware size, …) necessitate system level optimization Application requirements for system level optimization are Decomposable into components (input, filtering, output, …) Mappable to different configurations (# processors, # links, …) Measureable resource usage (time, memory, …) S3P demonstrates global optimization is feasible separate from the application


Download ppt "Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond"

Similar presentations


Ads by Google