Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

M. Muztaba Fuad Masters in Computer Science Department of Computer Science Adelaide University Supervised By Dr. Michael J. Oudshoorn Associate Professor.

High Performance Computing Course Notes Grid Computing.

Reference: Message Passing Fundamentals.

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.

Daniel Blackburn Load Balancing in Distributed N-Body Simulations.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.

Astro-DISC: Astronomy and cosmology applications of distributed super computing.

Chapter 2 Computer Clusters Lecture 2.1 Overview.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

N. GSU Slide 1 Chapter 02 Cloud Computing Systems N. Xiong Georgia State University.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

1 The Terabyte Analysis Machine Jim Annis, Gabriele Garzoglio, Jun 2001 Introduction The Cluster Environment The Distance Machine Framework Scales The.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

UPPMAX and UPPNEX: Enabling high performance bioinformatics Ola Spjuth, UPPMAX

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astronomical Data Analysis Jeffrey.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

The european ITM Task Force data structure F. Imbeaux.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey.

Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

UPC Status Report - 10/12/04 Adam Leko UPC Project, HCS Lab University of Florida Oct 12, 2004.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Parallel Computing Presented by Justin Reschke

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

Background Computer System Architectures Computer System Software.

Enterprise Requirements: Industry Workshops and OGF Robert Cohen, Area Director, Enterprise Requirements.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.

Big Data - Efficient SW Processing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

ChaNGa: Design Issues in High Performance Cosmology

GdX - Grid eXplorer parXXL: A Fine Grained Development Environment on Coarse Grained Architectures PARA 2006 – UMEǺ Jens Gustedt - Stéphane Vialle - Amelia.

Parallel Objects: Virtualization & In-Process Components

Grid Computing.

Programming Models for SimMillennium

Parallel and Multiprocessor Architectures – Shared Memory

Example of usage in Micron Italy (MIT)

CS510 - Portland State University

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Parallel Implementation of Adaptive Spacetime Simulations A

Computational issues Issues Solutions Large time scale

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew Connolly Cameron McBride Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on workstation Step 3: Extract meaningful scientific knowledge (happy scientist) Using 300 processors: (circa 1995)

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on server (in serial) Step 3: Extract meaningful scientific knowledge (happy scientist) Using 1000 processors: (circa 2000)

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? (unhappy scientist) Using processors: (circa 2006) X

Mining the Universe can be (Computationally) Expensive The size of simulations is no longer limited by computational power It is limited by the parallelizability of data analysis tools This situation, will only get worse in the future.

How to turn simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? Using 100,000 processors?: (circa 2012) X By 2012, we will have machines that will have many hundreds of thousands of cores!

The Challenge of Data Analysis in a Multiprocessor Universe Parallel programs are difficult to write! Steep learning curve to learn parallel programming Parallel programs are expensive to write! Lengthy development time Parallel world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to invest lots of time writing the code. Example: GASOLINE (a cosmology N-body code) Required 10 FTE-years of development

The Challenge of Data Analysis in a Multiprocessor Universe Data Analysis does not work this way: Rapidly changing scientific inqueries Less code reuse Simulation groups do not even write their analysis code in parallel! Data Mining paradigm mandates rapid software development!

How to turn observational data into scientific knowledge Step 1: Collect data Step 2: Analyze data on workstation Step 3: Extract meaningful scientific knowledge (happy astronomer)

The Era of Massive Sky Surveys Paradigm shift in astronomy: Sky Surveys Available data is growing at a much faster rate than computational power.

Good News for “Data Parallel” Operations Data Parallel (or “Embarrassingly Parallel”): Example: 1,000,000 QSO spectra Each spectrum takes ~1 hour to reduce Each spectrum is computationally independent from the others There are many workflow management tools that will distribute your computations across many machines.

Tightly-Coupled Parallelism (what this talk is about) Data and computational domains overlap Computational elements must communicate with one another Examples: Group finding N-Point correlation functions New object classification Density estimation

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe Build a library that is: Sophisticated enough to take care of all of the nasty parallel bits for you. Flexible enough to be used for your own particular astrophysics data analysis application. Scalable: scales well to thousands of processors.

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe Astrophysics uses dynamic, irregular data structures: Astronomy deals with point-like data in an N-dimensional parameter space Most efficient methods on these kind of data use space- partitioning trees. The most common data structure is a kd-tree.

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Thread orchestration Data management Things that inhibit scalability: Granularity (synchronization) Load balancing Data locality

Overview of existing paradigms: GSA There are existing globally shared address space (GSA) compilers and libraries: Co-Array Fortran UPC ZPL Global Arrays The Good: These are quite simple to use. The Good: Can manage data locality well. The Bad: Existing GSA approaches tend not to scale very well because of fine granularity. The Ugly: None of these support irregular data structures.

Overview of existing paradigms: GSA There are other GSA approaches that do lend themselves to irregular data structures: e.g. Linda (tuple-space) The Good: Almost universally flexible The Bad: These tend not to scale even worse than the previous GSA approaches. Granularity is too fine

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Thread orchestration Data management Things that inhibit scalability: Granularity Load balancing Data locality GSA

Overview of existing paradigms: RMI rmi_broadcast(…, (*myFunction)); RMI layer myFunction() RMI Layer myFunction() RMI Layer myFunction() RMI Layer myFunction() Proc. 0 Proc. 1Proc. 2 Proc. 3 Master Thread Computational Agenda myFunction() is coarsely grained “Remote Method Invocation”

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Thread orchestration Data management Things that inhibit scalability: Granularity Load balancing Data locality RMI

N tropy: A Library for Rapid Development of kd-tree Applications No existing paradigm gives us everything we need. Can we combine existing paradigms beneath a simple, yet flexible API?

N tropy: A Library for Rapid Development of kd-tree Applications Use RMI for orchestration Use GSA for data management

A Simple N tropy Example: N-body Gravity Calculation Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

A Simple N tropy Example: N-body Gravity Calculation ntropy_Dynamic(…, (*myGravityFunc)); N tropy master layer N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() Proc. 0 Proc. 1Proc. 2 Proc. 3 Master Thread Computational Agenda Particles on which to calculate gravitational force P1…P2Pn…

A Simple N tropy Example: N-body Gravity Calculation Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years To resolve the gravitational force on any single particle requires the entire dataset Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

A Simple N tropy Example: N-body Gravity Calculation N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() N tropy thread service layer myGravityFunc() Proc. 0 Proc. 1Proc. 2 Proc. 3 N tropy GSA layer

N tropy Performance Features GSA allows performance features to be provided “under the hood”: Interprocessor data caching < 1 in 100,000 off-PE requests actually result in communication. RMI allows further performance features Dynamic load balacing Workload can be dynamically reallocated as computation progresses.

N tropy Performance 10 million particles Spatial 3-Point 3->4 Mpc No interprocessor data cache, No load balancing Interprocessor data cache, No load balancing Interprocessor data cache, Load balancing

Why does the data cache make such a huge difference? myGravityFunc() Proc

N tropy “Meaningful” Benchmarks The purpose of this library is to minimize development time! Development time for: 1. Parallel N-point correlation function calculator 2 years -> 3 months 2. Parallel Friends-of-Friends group finder 8 months -> 3 weeks

Conclusions Most approaches for parallel application development rely on a single paradigm Inhibits scalability Inhibits generality Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator” upon which any paradigm can be imposed.

Conclusions Many “real-world” problems, especially those involving irregular data structures, demand a combination of paradigms N tropy provides: Remote Method Invocation (RMI) Globally Share Addressing (GSA)

Conclusions Tools that selectively deploy several parallel paradigms (rather than just one) may be what are needed to parallelize applications that use irregular/adaptive/dynamic data structures. More Information: Go to Wikipedia and seach “Ntropy”