N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
George Blank University Lecturer. CS 602 Java and the Web Object Oriented Software Development Using Java Chapter 4.
Reference: Message Passing Fundamentals.
Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Reasons to study concepts of PL
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Advances in Language Design
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew.
N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.
Programming Languages and Paradigms Object-Oriented Programming.
Review C++ exception handling mechanism Try-throw-catch block How does it work What is exception specification? What if a exception is not caught?
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Lucas Bang Lecture 15: Linked data structures.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astronomical Data Analysis Jeffrey.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
Web and Grid Services from Pitt/CMU Andrew Connolly Department of Physics and Astronomy University of Pittsburgh Jeff Gardner, Alex Gray, Simon Krughoff,
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Slide 1 Archive Computing: Scalable Computing Environments on Very Large Archives Andreas J. Wicenec 13-June-2002.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey.
Science on Supercomputers: Pushing the (back of) the envelope Jeffrey P. Gardner Pittsburgh Supercomputing Center Carnegie Mellon University University.
Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.
1.1 Sandeep TayalCSE Department MAIT 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Parallel Computing Presented by Justin Reschke
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
2.2 Threads  Process: address space + code execution  There is no law that states that a process cannot have more than one “line” of execution.  Threads:
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
Applied Operating System Concepts
Applying Control Theory to Stream Processing Systems
William Stallings Computer Organization and Architecture 8th Edition
OPERATING SYSTEMS CS3502 Fall 2017
Parallel Objects: Virtualization & In-Process Components
Grid Computing.
So far we have covered … Basic visualization algorithms
Operating System Concepts
Lecture 2 The Art of Concurrency
Overview of Workflows: Why Use Them?
Data Structures & Algorithms
Parallel Programming in C with MPI and OpenMP
Operating System Concepts
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University

Mining the Universe can be (Computationally) Expensive Computational Astrophysics: Size of simulations are frequently limited by inability to process simulation output (no parallel group finders). Current MPPs are 1000’s of processors…this is already too large for serial processing. Next generation of MPPs will be 100,000’s of processors!

Mining the Universe can be (Computationally) Expensive Observational Astronomy: Paradigm shift is astronomy: Sky Surveys Observers now generates ~ 1TB data per night With Virtual Observatories, one can pool data from multiple catalogs. Computational requirements are growing at a faster rate than computational power. There will be some problems that would be impossible without parallel machines. There will be many problems for which throughput can be substantially enhanced by parallel machines.

The Challenge of Parallel Data Analysis Parallel programs are hard to write! Steep learning curve to learn parallel programming Lengthy development time Parallel world is dominated by simulations: Code is often reused for many years by many people Therefore, you can afford to spend lots of time writing the code. Data Analysis does not work this way: Rapidly changing scientific inqueries Less code reuse Data Mining paradigm mandates rapid software development!

Tightly-Coupled Parallelism (what this talk is about) Data and computational domains overlap Computational elements must communicate with one another Examples: Group finding N-Point correlation functions New object classification Density estimation Solution(?): N Tropy

The Goal GOAL: Minimize development time for parallel applications. GOAL: Enable scientists with no parallel programming background (or time to learn) to still implement their algorithms in parallel. GOAL: Provide seamless scalability from single processor machines to MPPs…potentially even several MPPs in a computational Grid. GOAL: Do not restrict inquiry space.

Methodology Limited Data Structures: Astronomy deals with point-like data in an N-dimension parameter space Most efficient methods on these kind of data use trees. Limited Methods: Analysis methods perform a limited number of fundamental operations on these data structures.

N Tropy Design PKDGRAV already provides a number of advanced services PKDGRAV benefits to keep: Flexible client-server scheduling architecture Threads respond to service requests issued by master. To do a new task, simply add a new service. Portability Interprocessor communication occurs by high-level requests to “Machine-Dependent Layer” (MDL) which is rewritten to take advantage of each parallel architecture. Advanced interprocessor data caching < 1 in 100,000 off-PE requests actually result in communication.

N Tropy New Features Dynamic load balancing (available now) Workload and processor domain boundaries can be dynamically reallocated as computation progresses. Data pre-fetching?? (To be implemented) Predict request off-PE data that will be needed for upcoming tree nodes. mdlSlurp(): Prefetch a big block of data to a special block of local memory Intelligent prediction: Investigate active learning algorithms to prefetch off-PE data.

Performance 10 million particles Spatial 3-Point 3->4 Mpc (SDSS DR1 takes less than 1 minute with perfect load balancing)

PHASE 1 Performance 10 million particles Projected 3-Point 0->3 Mpc

NTropy Conceptual Schematic Computational Steering Layer C, C++, Python (Fortran?) Framework (“Black Box”) User serial collective staging and processing routines Web Service Layer (at least from Python) Domain Decomposition/ Tree Building Tree Traversal Parallel I/O User serial I/O routines VO WSDL? SOAP? Key: Framework Components Tree Services User Supplied Collectives Dynamic Workload Management User tree traversal routines User tree and particle data

N Tropy Design Use only Parallel Management Layer (pst) and Interprocessor Communication Layer (mdl) of PKDGRAV. Rewrite everything else from scratch Computational Steering Layer Parallel Client/Server Layer (pst) Serial Layer Gravity CalculatorHydro Calculator PKDGRAV Functional Layout Executes on master processor Coordinates execution and data distribution among processors Executes on all processors Interprocessor Communication Layer (mdl) Passes data between processors

N Tropy Functional Layout NTropy Computational Steering API Thread services layer (executed on all threads) PKDGRAV Parallel Client/Server Layer “User” Supplied Methods Executed on master thread only MDL Interprocessor communication layer Layers retained from PKDGRAV Layers completely rewritten Key: Application Steering script Application “Tasks” (serial code) Tree node and particle navigation methods: (e.g. Acquire node/particle, next node/particle, parent node, child node Executed on all threads ntropy_* functions ntp_* functions

Using Ntropy (example: N-body) Computational Steering Layer (nbody.c) #include “ntropy.h” /* Mandatory */ #include “nbody.h” /* My application-specific stuff */ int main(int argc,char **argv) { NTROPY ntropy; /* Mandatory */ struct nbodyStruct nbody; /* My App-specific struct */ /* Start ntropy. Only the master thread will ever return from this function. * The slave threads will return to "main_ch" in ntropy.c. */ ntropy = ntropy_Initialize(argc, argv); /* Process command line arguments */ /* Add nbody-specific command line arguments */ nbodyAddParams(ntropy, &nbody); /* Read in command line arguments */ ntropy_ParseParams(ntropy, argc, argv); /* Now that the command-line parameters have been parsed, we can * examine them. */ nbodyProcParams(ntropy, &nbody); /* Open log file and write header */ fpLogFile = nbodyLogParams(ntropy, &nbody); NTropy stuff App-specific stuff

Using Ntropy (example: N-body) Computational Steering Layer (nbody.c) /* Start threads by calling ntropyThreadStart. This starts NTropy computational services on all compute threads, pushes the data in Global and Param structs to all threads, and runs the function tskLocalInit on all threads. tskLocalInit is the constructor for your thread environment, and is convenient if you want to set up any local structs. You can also provide tskLocalFinish which is destructor for your thread environment. */ nThreads = ntropy_ThreadStart(ntropy, &(nbody.global), sizeof(nbody.global), &(nbody.param), sizeof(nbody.param), (*tskLocalInit), (*tskLocalFinish)); And in task_fof.c: void *tskLocalInit(NTP ntp) { struct nbodyLocalStruct *local; /* A struct that I invent that will store * all “thread-global” variables that I * will need for this computation */ local = (nbodyLocalStruct *)malloc(sizeof(struct nbodyLocalStruct)); return (void *)local; } void tskLocalFinish(NTP ntp) { struct nbodyLocalStruct *local = ntp_GetLocal(ntp); free(local); }

Using Ntropy (example: N-body) Computational Steering Layer (nbody.c) N-body structs: /* This struct has everything that I read in from the command line */ struct nbodyParamStruct { double dOmega0; int nSteps; … } /* This struct has everything that I don’t read in from the command line, but still want all threads to know */ struct nbodyGlobalStruct { double dRedshift; double dExpansionFactor; … } /* This struct has everything that I was store store locally in each compute thread. */ struct nbodyLocalStruct { int nParticleInteractions; int nCellInteractions; … }

Using Ntropy (example: N-body) Computational Steering Layer (nbody.c) Main N-body loop: nbody->iStep = 0; nbody->iOut = 0; while(nbody->iStep==0 || nbody->iOut < nbodySteps(&nbody)) { /* Build the tree */ ntropy_BuildTree(ntropy, (*tskCellAccum), (*tskBucketAccum)); /* Do the Gravity walk using dynamic load balancing */ ntropy_Dynamic(ntropy, (*tskGravityInit), (*tskGravityFinish), (*tskGravity)); /* Or, if you don’t want dynamic load balancing for this function, use ntropy_Static() */ /* Write results if needed (nbodyWriteOutput checks if it needs to output. If so, it calls ntropy_WriteParticles(), then increments msr->iOut.) */ nbodyWriteOutput(ntropy, &nbody); ++(msr->iStep); } /* Finish */ ntropy_Finish(ntropy);

Using Ntropy (example: N-body) Task Layer (nbody_task.c) void tskGravity(NTP ntp, NS *pNodeSpecStart) { /* Get all the structs that I will need */ struct nbodyParamStruct *param = ntp_GetParam(ntp); struct nbodyGlobalStruct *global = ntp_GetGlobal(ntp); struct nbodyLocalStruct *local = ntp_GetLocal(ntp); NS nodeSpecBucket; /* Handle for the current bucket node */ NS nodePtrBucket; /* Pointer to the data of the current bucket */ NS nodeSpecDone; /* The handle of the node that is “next” for nodeSpecStart. * When we walk to this node, we will be done. */ ntpNS_Next(ntp, pNodeSpecStart, &nodeSpecDone); ntpNS_Copy(ntp, pNodeSpecStart, &nodeSpecBucket); /*Find each bucket that is in pNodeSpecStart’s domain and do a treewalk for it*/ while(ntpNS_Compare(ntp, nodeSpecBucket, nodeSpecDone) { AQCUIRE_NODEPTR(ntp, nodePtr, nodeSpec); while(ntpNP_Type(ntp, nodePtr) != NTP_NODE_BUCKET) { ntpNS_ReplaceLower(ntp, &nodeSpec, nodePtr); ACQUIRE_NODEPTR(ntp, nodePtr, nodeSpec); } /* Now nodeSpecBucket and nodePtrBucket are pointing at the next bucket to process. */ MyWalkBucket(ntp, nodeSpecBucket, nodePtrBucket); }

Using Ntropy (example: N-body) Task Layer (nbody_task.c) void MyWalkBucket(NTP ntp, NS nodeSpecBucket, NP nodePtrBucket) { NS nodeSpec; /* Handle for the node currently being looked at */ NP nodePtr; /* Pointer to the data of the node currently being looked at */ ntpNS_PstStart(ntp, &nodeSpec); /* Get the handle of the root of the PST */ while (iWalkResult != NTP_NULL_PST) { ACQUIRE_NODEPTR(ntp, nodePtr, nodeSpec); iInteractResult = MyCellCellInteract(ntp, nodeSpecBucket, nodePtrBucket, nodePtr, nodeSpec); if (iInteractResult == MY_CELL_OPEN) iWalkResult = ntpNS_ReplaceLower(ntp, &nodeSpec, nodePtr); else /* iInteractResult == MY_CELL_NEXT */ iWalkResult = ntpNS_ReplaceNext(ntp, &nodeSpec, nodePtr); }

N Tropy Functional Layout NTropy Computational Steering API Thread services layer (executed on all threads) PKDGRAV Parallel Client/Server Layer “User” Supplied Methods Executed on master thread only MDL Interprocessor communication layer Layers retained from PKDGRAV Layers completely rewritten Key: Application Steering script Application “Tasks” (serial code) Tree node and particle navigation methods: (e.g. Acquire node/particle, next node/particle, parent node, child node Executed on all threads ntropy_* functions ntp_* functions

Further NTropy Features Custom cell and particle data Selectable cache types Read-only Combiner Explicit control over cache starting and stopping Cache statistics 10 user-selectable timers and 4 “automatic” timers

Further NTropy Features Custom command-line and parameter file options Automatic reduction variables Range of collectives (AllGather, AllToAll, AllReduce) I/O Primitives (TIPSY “array” and “vector” files) as well as flexible user- defined I/O.

Conclusions Most data analysis in astronomy is done using trees as the fundamental data structure. Most operations on these tree structures are functionally identical. Based on our studies so far, it appears feasible to construct a general purpose parallel framework that users can rapidly customize to their needs.