AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation.

AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation & Technology Associate Research Professor, Electrical and Computer Engineering Department

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 2 Context Scientific/Engineering applications –Complex, multi-physics, multiple time scales, multiple spatial scales Physics components –Elements such as I/O, solvers, etc. Computer Science components –Parallelism across components –Parallelism within components, particularly physics components Goal: efficient application execution on both parallel and distributed platforms Goal: simple, reusable programming

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 3 Types of Systems A lot of levels/layers to be aware of: –Individual computers Many layers of memory hierarchy Multi-core -> many-core CPUs –Clusters Used to be reasonably-tightly coupled computers (1 CPU per node) or SMPs (multiple CPUs per node) –Grids elements Individual computers Clusters Networks Instruments Data stores Visualization systems Etc…

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 4 Types of Applications Applications can be broken up into pieces (components) –Size (granularity) and relationship of pieces is key Fairly large pieces, no dependencies –Parameter sweeps, Monte Carlo analysis, etc. Fairly large pieces, some dependencies –Multi-stage applications - PHOEBUS –Workflow applications - Montage –Data grid apps? Large pieces, tight dependencies (coupling, components?) –Distributed viz, coupled apps - Climate Small pieces, no dependencies Small pieces, some dependencies –Dataflow? Small pieces, tight dependencies –MPI apps Hybrids?

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 5 Parallelism within programs Initial parallelism: bitwise/vector (SIMD) –“Highly computational tasks often contain substantial amounts of concurrency. At LLL the majority of these programs use very large, two-dimensional arrays in a cyclic set of instructions. In many cases, al new array values could be computed simultaneously, rather than stepping through one position at a time. To date, vectorization has been the most effective scheme for exploiting this concurrency. However, pipelining and independent multiprocessing forms of concurrency are also available in these programs, but neither the hardware not the software exist to make it workable.” (James R. McGraw, Data Flow Computing: The VAL Language, MIT Computational Structures Group Memo 188, 1980) –Westinghouse’s Solomon introduced vector processing, early 1960s –Continued in ILLIAC IV, ~1970s –Goodyear MPP, 128x128 array of 1 bit processors, ~1980

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 6 Unhappy with your programming model?

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 7 Parallelism across programs Co-operating Sequential Processes (CSP) - E. W. Dijkstra, The Structure of the “THE”-Multiprogramming System, 1968 –“We have given full recognition of the fact that in a single sequential process … only the time succession of the various states has a logical meaning, but not the actual speed with which the sequential process is performed. Therefore we have arranged the whole system as a society of sequential processes, progressing with undefined speed ratios. To each user program … corresponds a sequential process …” –“This enabled us to design the whole system in terms of these abstract "sequential processes". Their harmonious co-operation is regulated by means of explicit mutual synchronization statements. … The fundamental consequence of this approach … is that the harmonious co-operation of a set of such sequential processes can be established by discrete reasoning; as a further consequence the whole harmonious society of co-operating sequential processes is independent of the actual number of processors available to carry out these processes, provided the processors available can switch from process to process.”

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 8 Parallelism within programs (2) MIMD –Taxonomy from Flynn, 1972 –Dataflow parallelism “The data flow concept incorporates these forms of concurrency in one basic graph-oriented system. Every computation is represented by a data flow graph. The nodes … represent operations, the directed arcs represent data paths.” (McGraw, ibid) “The ultimate goal of data flow software must be to help identify concurrency in algorithms and map as much as possible into the graphs.” (McGraw ibid) –Transputer - 1984 programmed in occam –Uses CSP formalism, communication through named channels –MPPs - mid 1980s Explicit message passing (CSP) –Other models: actors, Petri nets, …

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 9 PHOEBUS + + + MPP MACHINE MESH SYSTEM OF EQUATIONS This matrix problem is filled and solved by PHOEBUS –The K submatrix is a sparse finite element matrix –The Z submatrices are integral equation matrices –The C submatrices are coupling matrices between the FE and IE equations 1996! - 3 Executable, 2+ programming models, executables run sequentially Credit: Katz, Cwik, Zuffada, Jamnejad

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 10 Cholesky Factorization SuperMatrix work - Chan and van de Geijn, Univ. of Texas, in progress Based on FLAME library Aimed at NUMA systems, OpenMP programming model Initial realization: poor performance of LAPACK (w/ multithreaded BLAS) could be fixed by choosing a different variant Credit: Ernie Chan

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 11 Cholesky Factorization Chol Trsm Syrk GemmTrsmSyrk Chol Trsm Chol Iteration 1Iteration 2Iteration 3 Can represent as DAG Chol Trsm GemmSyrk Chol … … Credit: Ernie Chan

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 12 Cholesky SuperMatrix Execute DAG tasks in parallel, possibly “out-of- order” –Similar in concept to Tomasulo’s algorithm and instruction-level parallelism on blocks of computation Superscalar -> SuperMatrix Credit: Ernie Chan

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 13 Uintah Framework de St. German, McCorquedale, Parker, Johnson at SCI Institute, Univ. of Utah Based on task graph model –Each algorithm define a description of computation Required inputs and outputs Callbacks to perform a task on a single region of space –Communication performed at graph edges –Graph created by Uintah

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 14 Uintah Tensor Product Task Graph Each task is replicated over regions in space Expresses data parallelism and task parallelism Resulting detailed graph is tensor product of master graph and spatial regions Efficient: –Detailed tasks not replicated on all processors Scalable: –Control structure known globally –Communication structure known locally Dependencies specified implicitly w/ simple algebra –Spatial dependencies Computes: –Variable (name, type) –Patch subset Requires: –Variable (name, type) –Patch subset –Halo specification –Other dependencies: AMR, others Master Graph (explicitly defined) Detailed Graph (implicitly defined) Credit: Steve Parker

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 15 Uintah - How It Works Credit: Steve Parker

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 16 Uintah - More Details Task graphs can be complex –Can include loops, nesting, recursion Optimal scheduling is NP-hard –“Optimal enough” scheduling isn’t too hard Creating schedule can be expensive –But may not be done too often Overall, good scaling and performance has been obtained with this approach Credit: Steve Parker

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 17 Applications and Grids How to map applications to grids? Some applications are Grid-unaware - they just want to run fast –May run on Grid-aware (Grid-enabled?) programming environments, e.g. MPICH-G2, MPIg Other apps are Grid-aware themselves –This is where SAGA fits in, as an API to permit the apps to interact with the middleware Credit: Thilo Kielmann Grid-unaware applicationsGrid-aware applications Grid resources, services, platforms Grid-enabled tools/environmentsSimple API (SAGA) Middleware

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 18 Common Grid Applications Data processing –Data exists on the grid, possibly replicated –Data is staged to a single set of resources –Application starts on that set of resources Parameter sweeps –Lots of copies of a sequential/parallel job launched on independent resources, with different inputs –Controlling process start jobs and gathers outputs

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 19 More Common Grid Applications Workflow applications –Multiple units of work, either sequential or parallel, either small or large –Data often transferred between tasks by files –Task sequence described as a graph, possibly a DAG –Abstract graph doesn’t include resource information –Concrete graph does –Some process/service converts graph from abstract to concrete Often all at once, ahead of job start - static mapping Perhaps more gradually (JIT?) - dynamic mapping Pegasus from ISI is an example of this, currently static –(Note: Parameter sweeps are very simple workflows)

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 20 Montage - a Workflow App An astronomical image mosaic service for the National Virtual Observatory http://montage.ipac.caltech.edu/ Delivers custom, science grade image mosaics –Image mosaic: combine many images so that they appear to be a single image from a single telescope or spacecraft –User specifies projection, coordinates, spatial sampling, mosaic size, image rotation –Preserve astrometry (to 0.1 pixels) & flux (to 0.1%) Modular, portable “toolbox” design –Loosely-coupled engines –Each engine is an executable compiled from ANSI C 100 µm sky; aggregation of COBE and IRAS maps (Schlegel, Finkbeiner and Davis, 1998). Covers 360 x 180 degrees in CAR projection. Supernova remnant S147, from IPHAS: The INT/WFC Photometric H-alpha Survey of the Northern Galactic Plane David Hockney Pearblossom Highway 1986

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 21 1 2 3 mProject 1mProject 2mProject 3 1 2 3 mDiff 1 2mDiff 2 3 mFitplane D 12 mFitplane D 23 ax + by + c = 0dx + ey + f = 0 a 1 x + b 1 y + c 1 = 0 a 2 x + b 2 y + c 2 = 0 a 3 x + b 3 y + c 3 = 0 mBackground 1mBackground 2mBackground 3 1 2 3 D 12 D 23 Montage Workflow mConcatFit mBgModel ax + by + c = 0 dx + ey + f = 0 mAdd 1 mAdd 2 Final Mosaic (Overlapping Tiles)

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 22 Montage on the Grid Using Pegasus (Planning for Execution on Grids) Example DAG for 10 input files mAdd mBackground mBgModel mProject mDiff mFitPlane mConcatFit Data Stage-in nodes Montage compute nodes Data stage-out nodes Registration nodes Pegasus Grid Information Systems Information about available resources, data location Grid Condor DAGMan Maps an abstract workflow to an executable form Executes the workflow MyProxy User’s grid credentials http://pegasus.isi.edu/

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 23 Montage Performance MPI version on a single cluster is baseline Grid version on a single cluster has similar performance for large problems Grid version on multiple clusters has performance dominated by data transfer between stages

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 24 Workflow Application Issues Apps need to map processing to clusters Depending on mapping, various data movement is needed, so the mapping either leads to networking requirements or is dependent on the available networking Prediction (and mapping) needs some intelligence One way to do this is through Pegasus, which currently does static mapping of an abstract workflow to a concrete workflow, but will do more dynamic mapping at some future point Networking resources and availability could be inputs to Pegasus, or Pegasus could be used to request network resources at various times during a run.

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 25 Making Use of Grids In general, groups of users (communities) want to run applications Code/User/Infrastructure is aware of environment and does: –Discover resources available now (or perhaps later) –Start my application –Have access to data and storage –Monitor and possibly steer the application Other things that could be done: –Migrate app to faster resources that are now available –Recover from hardware failure by continuing with fewer processors or by restarting from checkpoint on different resources –Use networks as needed (reserve them for these times) Credit: Thilo Kielmann and Gabrielle Allen

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 26 Less Common Grid Applications True distributed MPI application over multiple resources/clusters Other applications that use multiple coupled clusters Uncommon because these jobs run poorly without sufficient network bandwidth, and there has been no good way for users to reserve bandwidth when needed

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 27 SPICE Used for analyzing RNA translocation through protein pores Using “standard” molecular dynamics would need millions of CPU hours Instead, use Steered Molecular Dynamics and Jarzynski’s Equation (SMD-JE) –Uses static visualization to understand structural features –Uses interactive simulations to determine “near-optimal” parameters Uses Haptic interaction - requires low-latency bi-directional communication between user and simulation –Uses “near-optimal” parameters and many large parallel simulations to determine “optimal” parameters ~75 simulations on 128/256 processors –Uses “optimal” parameters to calculate full free energy profile along axis of pore ~100 simulations on 2500 processors Credit: Shantenu Jha, et. al.

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 28 NEKTAR Simulates arterial blood flow Uses hybrid approach –3D detailed CFD computed at bifurcations –Waveform coupling between bifurcations modeled w/ reduced set of 1D equations –55 largest arteries in human body w/ 27 bifurcations would require about 7 TB memory –Parallelized across and within clusters Credit: Shantenu Jha, et. al. SDSC TACC NCSA PSC

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 29 Cactus Freely available, modular, portable and manageable environment for collaboratively developing parallel, high-performance multi- dimensional simulations (components-based) Developed for numerical relativity, but now general framework for parallel computing (CFD, astro, climate, chem. eng., quantum gravity, etc.) Finite difference, AMR, FE/FV, multipatch Active user and developer communities, main development now at LSU and AEI Science-driven design issues Open source, documentation, etc. Just over 10 years old Credit: Gabrielle Allen

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 30 Cactus Structure Credit: Gabrielle Allen Core “Flesh” Plug-In “Thorns” (modules) driver input/output interpolation SOR solver coordinates boundaryconditions boundary conditions black holes equations of state remote steering wave evolvers multigrid parameters gridvariables grid variables errorhandling error handling scheduling extensibleAPIs extensible APIs makesystem make system ANSI C Fortran/C/C++

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 31 Credit: Gabrielle Allen, Erik Schnetter Cactus and Grids HTTPD thorn, allows web browser to connect to running simulation, examine state of running simulation, change parameters Worm thorn, makes Cactus app self-migrating Spawner thorn, any routine can be done on another resource TaskFarm, allows distributing of apps on Grid Run a single app using distributed MPI

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 32 EnLIGHTened Network research, driven by concrete application projects, all of which critically require progress in network technologies and tools that utilize them EnLIGHTened testbed: 10 Gbps optical networks running over NLR. Four all-photonic Calient switches are interconnected via Louisiana Optical Network Initiative (LONI), EnLIGHTened wave, and the Ultralight wave, all using GMPLS control plane technologies. Global alliance of partners Will develop, test, and disseminate advanced software and underlying technologies to: –Provide generic applications with the ability to be aware of their network, Grid environment and capabilities, and to make dynamic, adaptive and optimized use (monitor & abstract, request & control) of networks connecting various high end resources –Provide vertical integration from the application to the optical control plane, including extending GMPLS Will examine how to distribute the network intelligence among the network control plane, management plane, and the Grid middleware

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 33 EnLIGHTened Team Yufeng Xin Steve Thorpe Bonnie Hurst Joel Dunn Gigi Karmous-Edwards Mark Johnson John Moore Carla Hunt Lina Battestilli Andrew Mabe Ed Seidel Gabrielle Allen Seung Jong Park Jon MacLaren Andrei Hutanu Lonnie Leger Dan Katz Savera Tanwir Harry Perros Mladen Vouk Javad Boroumand Russ Gyurek Wayne Clark Kevin McGrattan Peter Tompsu Olivier Jerphagnon John Bowers Rick Schlichting John Strand Matti Hiltunen Steven Hunter Dan Reed Alan Blatecky Chris Heermann Yang Xia Xun Su

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 34 EnLIGHTened Testbed Cisco/UltraLight wave EnLIGHTened wave (Cisco/NLR) LONI wave Members: - MCNC GCNS - LSU CCT - NCSU - RENCI Official Partners: - AT&T Research - SURA - NRL - Cisco Systems - Calient Networks - IBM NSF Project Partners - OptIPuter - UltraLight - DRAGON - Cheetah International Partners Phosphorus - EC G-lambda - Japan -GLIF CHI HOU DAL TUL KAN PIT WDC OGD BOI CLE POR DEN SVL SEA To Asia To Canada To Europe San Diego CAVE wave VCL @NCSU

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 35 HARC: Highly Available Robust Co-allocator Extensible, open-sourced co-allocation system Can already reserve: –Time on supercomputers (advance reservation), and –Dedicated paths on GMPLS-based networks with simple topologies Uses Paxos Commit to atomically reserve multiple resources, while providing a highly-available service Used to coordinate bookings across EnLIGHTened and G-lambda testbeds in largest demonstration of its kind to date (more later) Used for setting up the network for Thomas Sterling’s HPC Class which goes out live in HD (more later) Credit: Jon MacLaren

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 36 KDDI NRM CRM Cluster CRM Cluster CRM Cluster NTT NRM CRM Cluster CRM Cluster EL NRM CRM Cluster CRM Cluster CRM Cluster JAPAN CRM US Application (MPI) Application (Visualization) Request Network bandwidth and Computers Reservation From xx:xx to yy:yy Request Network bandwidth and Computers Reservation From xx:xx to yy:yy

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 37 Data grid applications Remote visualization –Data is somewhere, needs to flow quickly and smoothly to a visualization app –Data could be simulation results, or measured data

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 38 iGrid 2005 demo Visualization at LSU Interaction among San Diego, LSU, Brno Data on remote LONI machines Distributed Viz/Collaboration

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 39 Video for visualization But also for videoconference between the three sites 1080i (1920x1080, 60fps interlaced): 1.5 Gbps / unidirectional stream, 4.5 Gbps each site (two incoming, one outgoing streams) Jumbo frames (9000 bytes), Layer 2 lossless (more or less) dedicated network Hardware capture: DVS Centaurus (HD-SDI) + DVI -> HD-SDI converter from Doremi Credit: Andrei Hutanu

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 40 Hardware setup – one site Credit: Andrei Hutanu

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 41 Video distribution Done in software (multicast not up to speed, optical multicast complicated to set up). Can do 1:4 distribution with high-end Opteron workstations. HD class 1-to-n –Only one stream is distributed - the one showing the presenter (Thomas Sterling) - others are just to LSU

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 42 Data analysis Future scenario motivated by increases in network speed Possibilities of simulations to store results locally are limited –Downsampling the output, not storing all data Use remote (distributed, possibly virtual) storage –Can store all data –This will enable new types of data analysis Credit: Andrei Hutanu

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 43 Components Storage –high-speed distributed file systems or virtual RAM disks –potential use cases: global checkpointing facility; data analysis using the data from this storage distribution could be determined by the analysis routines Data access –Various data selection routines gather data from the distributed storage elements (storage supports app-specific operations) Credit: Andrei Hutanu

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 44 More Components Data transport –Components of the storage are connected by various networks. May need to use different transport protocols Analysis (visualization or numerical analysis) –Initially single-machine but can also be distributed Data source –computed in advance and preloaded on the distributed storage initially –or live streaming from the distributed simulation Credit: Andrei Hutanu

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 45 Conclusions Applications exist where infrastructure exists that enables them Very few applications (and application authors) can afford to get ahead of the infrastructure We can run the same (grid-unaware) applications on more resources –Perhaps add features such as fault tolerance Use SAGA to help here?

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 46 SAGA Intent: SAGA is to grid apps what MPI is to parallel apps Questions/metrics: –Does SAGA enable rapid development of new apps? –Does it allow complex apps with less code? –Is it used in libraries? Roots: Reality Grid (ReG Steering Library), GridLab (GAT), and others came together at GGF Strawman API: –Uses SIDL (from Babel, CCA) –Language independent spec. –OO base design - can adapt to procedural languages Status: –Started between GGF 11 & GGF 12 (July/Aug 2004) –Draft API submitted to OGF early Oct. 2006 –Currently, responding to comments…

CENTER FOR COMPUTATION & TECHNOLOGY AT LOUISIANA STATE UNIVERSITY 47 More Conclusions Infrastructure is getting better Middleware developers are working on some of the right problems –If we want to keep doing the same things better –And add some new things (grid-aware apps) Web 3.1 is coming soon… –We’re not driving the distributed computing world –Have to keep trying new things

AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation.

Similar presentations

Presentation on theme: "AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation.

Similar presentations

Presentation on theme: "AT LOUISIANA STATE UNIVERSITY Parallelism and Distributed Applications Daniel S. Katz Director, Cyberinfrastructure and User Services, Center for Computation."— Presentation transcript:

Similar presentations

About project

Feedback