Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

M. Muztaba Fuad Masters in Computer Science Department of Computer Science Adelaide University Supervised By Dr. Michael J. Oudshoorn Associate Professor.
The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Reference: Message Passing Fundamentals.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Astro-DISC: Astronomy and cosmology applications of distributed super computing.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh
Distributed Shared Memory Systems and Programming
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astronomical Data Analysis Jeffrey.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Victoria, May 2006 DAL for theorists: Implementation of the SNAP service for the TVO Claudio Gheller, Giuseppe Fiameni InterUniversitary Computing Center.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
The european ITM Task Force data structure F. Imbeaux.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Parallel Computing Presented by Justin Reschke
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Shared Memory
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
The Arabica Project A distributed scientific computing project based on a cluster computer and Java technologies. Daniel D. Warner Dept. of Mathematical.
Gary M. Zoppetti Gagan Agrawal
CS510 - Portland State University
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
An Orchestration Language for Parallel Objects
Emulating Massively Parallel (PetaFLOPS) Machines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey P. Gardner Andrew Connolly Cameron McBride Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University

How to turn astrophysics simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on workstation Step 3: Extract meaningful scientific knowledge (happy scientist) Using 300 processors: (circa 1995)

How to turn astrophysics simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on server (in serial) Step 3: Extract meaningful scientific knowledge (happy scientist) Using 1000 processors: (circa 2000)

How to turn astrophysics simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? (unhappy scientist) Using processors: (circa 2006) X

Exploring the Universe can be (Computationally) Expensive The size of simulations is no longer limited by computational power It is limited by the parallelizability of data analysis tools This situation, will only get worse in the future.

How to turn astrophysics simulation output into scientific knowledge Step 1: Run simulation Step 2: Analyze simulation on ??? Using ~1,000,000 cores?: (circa 2012) X By 2012, we will have machines that will have many hundreds of thousands of cores!

The Challenge of Data Analysis in a Multiprocessor Universe Parallel programs are difficult to write! Steep learning curve to learn parallel programming Parallel programs are expensive to write! Lengthy development time Parallel world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to invest lots of time writing the code. Example: GASOLINE (a cosmology N-body code) Required 10 FTE-years of development

The Challenge of Data Analysis in a Multiprocessor Universe Data Analysis does not work this way: Rapidly changing scientific inqueries Less code reuse Simulation groups do not even write their analysis code in parallel! Data Mining paradigm mandates rapid software development!

How to turn observational data into scientific knowledge Step 1: Collect data Step 2: Analyze data on workstation Step 3: Extract meaningful scientific knowledge (happy astronomer) Observe at Telescope (circa 1990)

Use Sky Survey Data (circa 2005) How to turn observational data into scientific knowledge Step 1: Collect data Step 2: Analyze data on ??? Sloan Digital Sky Survey (500,000 galaxies) X (unhappy astronomer) 3-point correlation function: ~200,000 node-hours of computation

Use Sky Survey Data (circa 2012) How to turn observational data into scientific knowledge Large Synoptic Survey Telescope (2,000,000 galaxies) 3-point correlation function: ~several petaflop weeks of computation

Tightly-Coupled Parallelism (what this talk is about) Data and computational domains overlap Computational elements must communicate with one another Examples: Group finding N-Point correlation functions New object classification Density estimation

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe Build a library that is: Sophisticated enough to take care of all of the nasty parallel bits for you. Flexible enough to be used for your own particular astrophysics data analysis application. Scalable: scales well to thousands of processors.

The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe Astrophysics uses dynamic, irregular data structures: Astronomy deals with point-like data in an N-dimensional parameter space Most efficient methods on these kind of data use space- partitioning trees. The most common data structure is a kd-tree. Build a targeted library for distributed-memory kd-trees that is scalable to thousands of processing elements

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Work orchestration Data management Things that inhibit scalability: Granularity (synchronization, consistency) Load balancing Data locality Structured data Memory consistency

Overview of existing paradigms: DSM There are many existing distributed shared-memory (DSM) tools. Compilers: UPC Co-Array Fortran Titanium ZPL Linda Libraries Global Arrays TreadMarks IVY JIAJIA Strings Mirage Munin Quarks CVM

Overview of existing paradigms: DSM The Good: These are quite simple to use. The Good: Can manage data locality pretty well. The Bad: Existing DSM approaches tend not to scale very well because of fine granularity. The Ugly: Almost none support structured data (like trees).

Overview of existing paradigms: DSM There are some DSM approaches that do lend themselves to structured data: e.g. Linda (tuple-space) The Good: Almost universally flexible The Bad: These tend not to scale even worse than simple unstructured DSM approaches. Granularity is too fine

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Work orchestration Data management Things that inhibit scalability: Granularity Load balancing Data locality DSM

Overview of existing paradigms: RMI rmi_broadcast(…, (*myFunction)); RMI layer myFunction() RMI Layer myFunction() RMI Layer myFunction() RMI Layer myFunction() Proc. 0 Proc. 1Proc. 2 Proc. 3 Master Thread Computational Agenda myFunction() is coarsely grained “Remote Method Invocation”

RMI Performance Features Coarse Granulary Thread virtualization Queue many instances of myFunction() on each physical thread. RMI Infrastucture can migrate these instances to achieve load balacing.

Overview of existing paradigms: RMI RMI can be language based: Java CHARM++ Or library based: RPC ARMI

Challenges for scalable parallel application development: Things that make parallel programs difficult to write Work orchestration Data management Things that inhibit scalability: Granularity Load balancing Data locality RMI

N tropy: A Library for Rapid Development of kd-tree Applications No existing paradigm gives us everything we need. Can we combine existing paradigms beneath a simple, yet flexible API?

N tropy: A Library for Rapid Development of kd-tree Applications Use RMI for orchestration Use DSM for data management Implementation of both is targeted towards astrophysics

A Simple N tropy Example: N-body Gravity Calculation Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

A Simple N tropy Example: N-body Gravity Calculation ntropy_Dynamic(…, (*myGravityFunc)); N tropy master RMI layer N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() Proc. 0 Proc. 1Proc. 2 Proc. 3 Master Thread Computational Agenda Particles on which to calculate gravitational force P1…P2Pn…

A Simple N tropy Example: N-body Gravity Calculation Cosmological “N-Body” simulation 100,000,000 particles 1 TB of RAM 100 million light years To resolve the gravitational force on any single particle requires the entire dataset Proc 0Proc 1Proc 2 Proc 5Proc 4Proc 3 Proc 6Proc 7Proc 8

A Simple N tropy Example: N-body Gravity Calculation N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() N tropy thread RMI layer myGravityFunc() Proc. 0 Proc. 1Proc. 2 Proc. 3 N tropy DSM layer Data Work

N tropy Performance Features DSM allows performance features to be provided “under the hood”: Interprocessor data caching for both reads and writes < 1 in 100,000 off-PE requests actually result in communication. Updates through DSM interface must be commutative Relaxed memory model allows multiple writers with no overhead Consistency enforced through global synchronization

N tropy Performance Features RMI allows further performance features Thread virtualization Divide workload into many more pieces than physical threads Dynamic load balacing is achieved by migrating work elements as computation progresses.

N tropy Performance 10 million particles Spatial 3-Point 3->4 Mpc No interprocessor data cache, No load balancing Interprocessor data cache, No load balancing Interprocessor data cache, Load balancing

Why does the data cache make such a huge difference? myGravityFunc() Proc

N tropy “Meaningful” Benchmarks The purpose of this library is to minimize development time! Development time for: 1. Parallel N-point correlation function calculator 2 years -> 3 months 2. Parallel Friends-of-Friends group finder 8 months -> 3 weeks

Conclusions Most approaches for parallel application development rely on providing a single paradigm in the most general possible manner Many scientific problems tend not to map well onto single paradigms Providing an ultra-general single paradigm inhibits scalability

Conclusions Scientists often borrow from several paradigms and implement them in a restricted and targeted manner. Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator” upon which any paradigm can be imposed.

Conclusions N tropy provides: Remote Method Invocation (RMI) Distributed Shared-Memory (DSM) Implementation of these paradigms is “lean and mean” Targeted specifically for problem domain This approach successfully enables astrophysics data analysis Substantially reduces application development time Scales to thousands of processors More Information: Go to Wikipedia and seach “Ntropy”