- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.

Slides:



Advertisements
Similar presentations
CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.
Advertisements

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
CS 290H 7 November Introduction to multigrid methods
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Communication Pattern Based Node Selection for Shared Networks
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
Presented by Rengan Xu LCPC /16/2014
Introduction CS 524 – High-Performance Computing.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Benchmarks for Parallel Systems Sources/Credits:  “Performance of Various Computers Using Standard Linear Equations Software”, Jack Dongarra, University.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
1 Discussions on the next PAAP workshop, RIKEN. 2 Collaborations toward PAAP Several potential topics : 1.Applications (Wave Propagation, Climate, Reactor.
File System Benchmarking
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
 The PFunc Implementation of NAS Parallel Benchmarks. Presenter: Shashi Kumar Nanjaiah Advisor: Dr. Chung E Wang Department of Computer Science California.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Discontinuous Galerkin Methods and Strand Mesh Generation
“DECISION” PROJECT “DECISION” PROJECT INTEGRATION PLATFORM CORBA PROTOTYPE CAST J. BLACHON & NGUYEN G.T. INRIA Rhône-Alpes June 10th, 1999.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Major Disciplines in Computer Science Ken Nguyen Department of Information Technology Clayton State University.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Parallel Solution of the Poisson Problem Using MPI
Cracow Grid Workshop, November 5-6, 2001 Concepts for implementing adaptive finite element codes for grid computing Krzysztof Banaś, Joanna Płażek Cracow.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Parallel Computing Presented by Justin Reschke
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,
The Finite Difference Time Domain Method FDTD Haythem H. abdullah ERI, Electronics Research Institute, Microwave Engineering Dept.
Parallel Algorithm Design & Analysis Course Dr. Stephen V. Providence Motivation, Overview, Expectations, What’s next.
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Parallel processing is not easy
Programming Models for SimMillennium
Lecture 19 MA471 Fall 2003.
Many-core Software Development Platforms
A configurable binary instrumenter
Memory Opportunity in Multicore Era
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
CS 252 Project Presentation
CARLA Buenos Aires, Argentina - Sept , 2017
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation May 6, 2005

- 2 - Workshop on Pattern Analysis Outline  Why Data Flow Pattern Analysis?  CFD Applications  The NAS Parallel Benchmarks  The NAS Grid Benchmarks  Trace File Analysis  Conclusions

- 3 - Workshop on Pattern Analysis Why Data Flow Pattern Analysis?  Scientific applications –model few natural processes –new effects are added infrequently –influence on the existing data flows are insignificant  Knowledge of data flow in program helps with –program understanding –program optimization, parallelization, multithreading –building application performance model

- 4 - Workshop on Pattern Analysis Design of Scientific Applications  Time represented as an outer loop –Iterations over time step  Space is represented by structured/unstructured grids –Important for understanding data locality –Data access patterns –Spatial parallelism  Physics is represented by an operator at each grid point –Data flow –Operator level of parallelism/dependence

- 5 - Workshop on Pattern Analysis CFD Data Flow Patterns  Solve the Navier-Stokes equation K(u i+1 )=Lu i K(u i+1 )=Lu i –u is five-dimensional vector –K is non-linear operator  Solver  RHS computation

- 6 - Workshop on Pattern Analysis ADI Pattern x-solve y-solve z-solve Multipartition x-solvey-solvez-solve ADI method  ADI method K~Kx*Ky*Kz  Multilevel parallelism

- 7 - Workshop on Pattern Analysis BT Communication

- 8 - Workshop on Pattern Analysis Explicit Operators  Stencil operators (explicit methods)  At each point of a 3-dimensional mesh apply: seven-point27-point

- 9 - Workshop on Pattern Analysis )  Two-dimensional pipeline  Hyperplane algorithm Dependence Matrices (() Lower-Upper Triangular

Workshop on Pattern Analysis LU Communication

Workshop on Pattern Analysis Multigrid V-Cycle Interpolation & Smoothing Smoothing Projection Projection Projection Projection

Workshop on Pattern Analysis MG Communication

Workshop on Pattern Analysis BT x_solve (serial) Call Graph Data Flow Analysis do k=1,ksize do i=1,isize do j=1,jsize

Workshop on Pattern Analysis Nest Data Flow Graph Nest Data Flow Graph do_45do_134do_330 Each arc represents Affinity Relation

Workshop on Pattern Analysis NAS Parallel Benchmarks  Application Benchmarks –CFD –BT, SP, LU – Data Intensive –DC, DT, BTIO – Computational Chemistry –UA  Kernel Benchmarks –FT, CG, MG, IS  Verification  Performance Model  FORTRAN, C, HPF, Java*  Serial, MPI, OpenMP, Java* Threads * Other names and brands may be claimed as the property of others.

Workshop on Pattern Analysis NPB Performance on Altix* ** * Other names and brands may be claimed as the property of others. ** Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.

Workshop on Pattern Analysis Basic Data Flow Patterns  Shuffles –Sorting –FFT –Routing  Gather/Scatter –Conjugate Gradient –MD and FE codes –Sparse matrices  Transpose –FFT –Sorting  Tree –Parallel prefix, Reduction –Sorting

Workshop on Pattern Analysis HPC Challenge Benchmarks  HPL*  DGEMM*  STREAM*  PTRANS*  FFTE*  RandomAccess*  Effective Bandwidth b_eff* icl.cs.utk.edu/hpcc * Other names and brands may be claimed as the property of others.

Workshop on Pattern Analysis Programming With Directed Graphs  Arc –Arc* newArc(Node *tail, Node *head) –AttachArc(DGraph *dg) –deleArc(Arc *ar)  Node –newNode(char *name) –Node* AttachNode(DGraph *dg) –deleteNode(Node *nd)  DGraph –DGraph* newDGraph(char *name) –writeGraph(DGraph *dg, char* fname) –DGraph * readGraph(char* fname) Implemented in DT of NPB and in NGB do_134

Workshop on Pattern Analysis Directed Graphs Around  Parse trees  File Systems  Application task graphs  Device Schematics  VCG tool  Edge tool  Tom Sawyer Software  Commercial tools Visualization and layout Tools

Workshop on Pattern Analysis Cart3D*  Performs CFD analysis on complex geometries  Uses six executables –Intersect* – intersects geometry –Cubes* – produces Cartesian meshes –Reorder* – reorders meshes –Mgprep* – coarsens mesh –flowCart* – convergence acceleration –Clic* – analyzes the flow  Executables communicate via files  Returns relevant forces –Lift, Drag, Side Force Task Graphs are rapidly growing * Other names and brands may be claimed as the property of others.

Workshop on Pattern Analysis The NAS Grid Benchmarks  Reflect task level programming paradigm  Contain four patterns –Embarrassingly Distributed (ED) –Helical Chain (HC) –Visualization Pipeline (VP) –Mixed Bag (MB) Launch Report SP Embarrassingly Distributed (ED) Launch Report BT MG FT BT MG FT BT MG FT Visualization Pipeline (VP) Report Launch FT 8 FT 2 LU 2 LU 4 LU 8 MG 4 MG 8 MG 2 Mixed Bag (MB) #steps Report Launch BTSPLU BTSPLU BTSPLU Helical Chain (HC)

Workshop on Pattern Analysis Data Dependent Patterns  Intermittent patterns –Useful for application performance tuning  Visualization is important –Allows to employ human eye ability to detect patterns  Automatic Pattern Mining –OLAP approach  MPI communication patterns Automatic Trace Analysis Using OLAP

Workshop on Pattern Analysis Conclusions Data Flow in Applications Data Flow in Applications  Application Parallelization  Application Understanding  Application Mapping  Application Performance