Peter Aronsson Automatic Parallelization of Simulation Code from Equation Based Simulation Languages Peter Aronsson, Industrial phd student, PELAB SaS.

Slides:

Advertisements

Similar presentations

Load Balancing Parallel Applications on Heterogeneous Platforms.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

U of Houston – Clear Lake

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

ECE 667 Synthesis and Verification of Digital Circuits

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

CSCI 4717/5717 Computer Architecture

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

{kajny, GridModelica: Modeling and Simulating on the Grid Håkan Mattsson, Christoph W. Kessler, Kaj Nyström, Peter Fritzson Programming.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

GridModelica High Level Modeling on the Grid Kaj Nyström Dept. of Computer and Information Science, Linköping University MathCore Engineering.

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Recap from last time We saw various different issues related to program analysis and program transformations You were not expected to know all of these.

Chapter 3 Planning Your Solution

Scheduling Parallel Task

1 Compiling with multicore Jeehyung Lee Spring 2009.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Parallelism and Robotics: The Perfect Marriage By R.Theron,F.J.Blanco,B.Curto,V.Moreno and F.J.Garcia University of Salamanca,Spain Rejitha Anand CMPS.

1 Scheduling CEG 4131 Computer Architecture III Miodrag Bolic Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Transformation of Timed Automata into Mixed Integer Linear Programs Sebastian Panek.

MathCore Engineering AB Experts in Modeling & Simulation WTC.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

6-1 Chapter 6 - Languages and the Machine Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.

InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Static Process Scheduling

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.

Resource Allocation in Hospital Networks Based on Green Cognitive Radios 王冉茵

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Static Timing Analysis

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Distributed and Parallel Processing

B-Trees B-Trees.

Conception of parallel algorithms

Amir Kamil and Katherine Yelick

GENERAL VIEW OF KRATOS MULTIPHYSICS

Amir Kamil and Katherine Yelick

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Presentation transcript:

Peter Aronsson Automatic Parallelization of Simulation Code from Equation Based Simulation Languages Peter Aronsson, Industrial phd student, PELAB SaS IDA Linköping University, Sweden Based on Licentiate presentation & CPC’03 Presentation

Peter Aronsson Outline Introduction Task Graphs Related work on Scheduling & Clustering Parallelization Tool Contributions Results Conclusion & Future Work

Peter Aronsson Introduction Modelica –Object Oriented, Equation Based, Modeling Language Modelica enable modeling and simulation of large and complex multi-domain systems Large need for parallel computation –To decrease time of executing simulations –To make large models possible to simulate at all. –To meet hard real time demands in hardware-in-the- loop simulations

Peter Aronsson Examples of large complex systems in Modelica

Peter Aronsson Modelica Example - DCmotor

Peter Aronsson Modelica example model DCMotor import Modelica.Electrical.Analog.Basic.*; import Modelica.Electrical.Sources.StepVoltage; Resistor R1(R=10); Inductor I1(L=0.1); EMF emf(k=5.4); Ground ground; StepVoltage step(V=10); Modelica.Mechanics.Rotational.Inertia load(J=2.25); equation connect(R1.n, I1.p); connect(I1.n, emf.p); connect(emf.n, ground.p); connect(emf.flange_b, load.flange_a); connect(step.p, R1.p); connect(step.n, ground.p); end DCMotor;

Peter Aronsson Example – Flat set of Equations R1.v = -R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v I1.v = -I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v emf.v =-emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v emf.flange_b.tau = -emf.i*emf.k ground.p.v = 0 step.v = -step.n.v+step.p.v 0 = step.n.i+step.p.i step.i = step.p.i step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1] then 0 else step.signalSource.p_height[1])+step.signalSource.p_offset[1] step.v = step.signalSource.outPort.signal[1] load.flange_a.phi = load.phi load.flange_b.phi = load.phi load.w = load.der(phi) load.a = load.der(w) load.a*load.J = load.flange_a.tau+load.flange_b.tau R1.n.v = I1.p.v I1.p.i+R1.n.i = 0 I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phi emf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v R1.p.i+step.p.i = 0 load.flange_b.tau = 0 step.signalSource.y = step.signalSource.outPort.signal

Peter Aronsson load.flange_a.tau load.w load.flange_a.tau load.w Plot of Simulation result

Peter Aronsson Task Graphs Directed Acyclic Graph (DAG) G = (V,E, ,c) V – Set of nodes, representing computational tasks E – Set of edges, representing communication of data between tasks  (v) – Execution cost for node v c(i,j) – Communication cost for edge (i,j) Referred to as the delay model (macro dataflow model)

Peter Aronsson Small Task Graph Example

Peter Aronsson Task Scheduling Algorithms Multiprocessor Scheduling Problem –For each task, assign Starting time Processor assignment (P 1,...P N ) –Goal: minimize execution time, given Precedence constraints Execution cost Communication cost Algorithms in literature –List Scheduling approaches (ERT, FLB) –Critical Path scheduling approaches (TDS, MCP) Categories: Fixed No. of Proc, fixed c and/or ,...

Peter Aronsson Granularity Granularity g = min(  (v))/max(c(i,j)) Affects scheduling result –E.g. TDS works best for high values of g, i.e. low communication cost Solutions: – Clustering algorithms IDEA: build clusters of nodes where nodes in the same cluster are executed on the same processor –Merging algorithms Merge tasks to increase computational cost.

Peter Aronsson Task Clustering/Merging Algorithms Task Clustering Problem: –Build clusters of nodes such that parallel time decreases –PT(n) = tlevel(n)+blevel(n) –By zeroing edges, i.e. putting several nodes into the same cluster => zero communication cost. Literature: –Sarkars Internalization alg., Yangs DSC alg. Task Merging Problem –Transform the Task Graph by merging nodes Literature: E.g. Grain Packing alg.

Peter Aronsson Clustering v.s. Merging Clustered Task Graph merging Merged Task Graph ,6 6 2,5,

Peter Aronsson DSC algorithm 1.Initially, put each node a separate cluster. 2.Traverse Task Graph –Merge clusters as long as Parallel Time does not increase. Low complexity O((n+e) log n) Previously used by Andersson in ObjectMath (PELAB)

Peter Aronsson Modelica Compilation Modelica model (.mo) Modelica semantics Equation system (DAE) Opt. Rhs calculations Flat modelica (.mof) Numerical solver C code Structure of simulation code: for t=0;t<stopTime;t+=stepSize { x_dot[t+1] = f(x_dot[t],x[t],t); x[t+1] = ODESolver(x_dot[t+1]); }

Peter Aronsson Optimizations on equations Simplification of equations E.g. a=b, b=c eliminate => b BLT transformation, i.e. topological sorting into strongly connected components (BLT = Block Lower Triangular form) Index reduction, Index is how many times an equation needs to be differentiated in order to solve the equation system. Mixed Mode /Inline Integration, methods of optimizing equations by reducing size of equation systems a b c d e 0

Peter Aronsson Generated C Code Content Assignment statements Arithmetic expressions (+,-,*,/), if-expressions Function calls –Standard Math functions Sin, Cos, Log –Modelica Functions User defined, side effect free –External Modelica Functions In External lib, written in Fortran or C –Call function for solving subsystems of equations Linear or non-linear Example Application –Robot simulation has lines of generated C code

Peter Aronsson Parallelization Tool Overview Modelica Compiler C compiler Model.mo C code C compiler Parallelizer Parallel C code Solver lib MPI lib Seq exe Parallel exe

Peter Aronsson Parallelization Tool Internal Structure Parser Task Graph Builder Symbol Table Scheduler Code Generator Debug & Statistics Sequential C code Parallel C code

Peter Aronsson Task Graph building First graph: corresponds to individual arithmetic operations, assignments, function calls and variable definitions in the C code Second graph: Clusters of tasks from first task graph Example: + - * foo - / + * abc d defs +,-,* +,* foo/,-

Peter Aronsson Investigated Scheduling Algorithms Parallelization Tool –TDS (Task Duplications Scheduling Algorithm) –Pre – Clustering Method –Full Task Duplication Method Experimental Framework (Mathematica) –ERT –DSC –TDS –Full Task Duplication Method –Task Merging approaches (Graph Rewrite Systems)

Peter Aronsson Method 1:Pre Clustering algorithm –buildCluster(n:node, l:list of nodes, size:Integer) –Adds n to a new cluster –Repeatedly adds nodes until the size(cluster)=size –Children to n –One in-degree children to cluster –Siblings to n –Parents to n –Arbitrary nodes

Peter Aronsson Managing cycles When adding a node to a cluster the resulting graph might have cycles Resulting graph when clustering a and b is cyclic since you can reach {a,b} from c Resulting graph not a DAG –Can not use standard scheduling algorithms a b c d e

Peter Aronsson Pre Clustering Results Did not produce Speedup –Introduced far too many dependencies in resulting task graph –Sequentialized schedule Conclusion: –For fine grained task graphs: Need task duplication in such algorithm to succeed

Peter Aronsson Method 2: Full Task Duplication For each node:n with successor(n)={} –Put all pred(n) in one cluster Repeat for all nodes in cluster –Rationale: If depth of graph limited, task duplication will be kept at reasonable level and cluster size reasonable small. –Works well when communication cost >> execution cost

Peter Aronsson Full Task Duplication (2) Merging clusters 1.Merge clusters with load balancing strategy, without increasing maximum cluster size 2.Merge clusters with greatest number of common nodes Repeat (2) until number of processors requirement is met

Peter Aronsson Full Task Duplication Results Computed measurements –Execution cost of largest cluster + communication cost Measured speedup –Executed on PC Linux cluster SCI network interface, using SCAMPI

Peter Aronsson Robot Example Computed Speedup Mixed Mode / Inline Integration With MM/II Without MM/II

Peter Aronsson Thermofluid pipe executed on PC Cluster Pressurewavedemo in Thermofluid package 50 discretization points

Peter Aronsson Thermofluid pipe executed on PC Cluster Pressurewavedemo in Thermofluid package 100 discretization points

Peter Aronsson Task Merging using GRS Idea: A set of simple rules to transform a task graph to increase its granularity (and decrease Parallel Time) Use top level (and bottom level) as metric: Parallel Time = max tlevel + max blevel

Peter Aronsson Rule 1 Merging a single child with only one parent. Motivation: The merge does not decrease amount of parallelism in the task graph. And granularity can possibly increase. p c p’

Peter Aronsson Rule 2 Merge all parents of a node together with the node itself. Motivation: If the top level does not increase by the merge the resulting task will increase in size, potentially increasing granularity. p1p1 c c’ p2p2 pnpn …

Peter Aronsson Rule 3 Duplicate parent and merge into each child node Motivation: As long as each child’s tlevel does not increase, duplicating p into the child will reduce the number of nodes and increase granularity. c2c2 p cncn c1c1 c2’c2’ cn’cn’c1’c1’ … …

Peter Aronsson Rule 4 Merge siblings into a single node as long as a parameterized maximum execution cost is not exceeded. Motivation: This rule can be useful if several small predecessor nodes exist and a larger predecessor node which prevents a complete merge. Does not guarantee decrease of PT. p1p1 c p2p2 pnpn p´ c P k+1 pnpn … …

Peter Aronsson Results – Example Task graph from Modelica simulation code –Small example from the mechanical domain. –About 100 nodes built on expression level, originating from 84 equations & variables

Peter Aronsson Result Task Merging example B=1, L=1

Peter Aronsson Result Task Merging example –B=1, L=10 –B=1, L=100

Peter Aronsson Conclusions Pre Clustering approach did not work well for the fine grained task graphs produced by our parallelization tool FTD Method –Works reasonable well for some examples However, in general: –Need for better scheduling/clustering algorithms for fine grained task graphs

Peter Aronsson Conclusions (2) Simple delay model may not be enough –More advanced model require more complex scheduling and clustering algorithms Simulation code from equation based models –Hard to extract parallelism from –Need new optimization methods on DAE:s or ODE:s to increase parallelism

Peter Aronsson Conclusions Task Merging using GRS A task merging algorithm using GRS have been proposed –Four rules with simple patterns => fast pattern matching Can easily be integrated in existing scheduling tools. Successfully merges tasks considering –Bandwidth & Latency –Task duplication –Merging criterion: decrease Parallel Time, by decreasing tlevel (PT) Tested on examples from simulation code

Peter Aronsson Future Work Designing and Implementing Better Scheduling and Clustering Algorithms –Support for more advanced task graph models –Work better for high granularity values Try larger examples Test on different architectures –Shared Memory machines –Dual processor machines

Peter Aronsson Future Work (2) Heterogeneous multiprocessor systems –Mixed DSP processors, RISC,CISC, etc. Enhancing Modelica language with data parallelism –e.g. parallel loops, vector operations Parallelize e.g. combined PDE and ODE problems in Modelica. Using e.g. SCALAPACK for solving subsystems of linear equations. How to integrate into scheduling algorithms?