TMC BioGrid A GCC Consortium Ken Kennedy Center for High Performance Software Research (HiPerSoft) Rice University

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Resource Management of Grid Computing

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.

AppLeS, NWS and the IPG Fran Berman UCSD and NPACI Rich Wolski UCSD, U. Tenn. and NPACI This presentation will probably involve audience discussion, which.

Achieving Application Performance on the Computational Grid Francine Berman This presentation will probably involve audience discussion, which will create.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

1 Jack Dongarra University of Tennesseehttp://

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

DISTRIBUTED COMPUTING

Cluster Reliability Project ISIS Vanderbilt University.

Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.

Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Development Timelines Ken Kennedy Andrew Chien Keith Cooper Ian Foster John Mellor-Curmmey Dan Reed.

1 Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Micah Beck Jack Dongarra Terry Moore James Plank University.

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Software Support for High Performance Problem Solving on the Grid An Overview of the GrADS Project Sponsored by NSF NGS Ken Kennedy Center for High Performance.

High Performance Computing on the Grid: Is It for You? With a Discussion of Help on the Way (the GrADS Project) Ken Kennedy Center for High Performance.

Systems Analysis and Design in a Changing World, Fourth Edition

1 EECS 6083 Compiler Theory Based on slides from text web site: Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Telescoping Languages A Framework for Generating High- Performance Problem-Solving Systems Ken Kennedy Center for High Performance Software Rice University.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

7. Grid Computing Systems and Resource Management

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.

Lawrence H. Landweber National Science Foundation SC2003 November 20, 2003

- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,

PACI Program : One Partner’s View Paul R. Woodward LCSE, Univ. of Minnesota NSF Blue Ribbon Committee Meeting Pasadena, CA, 1/22/02.

LIGO-G Z1 Using Condor for Large Scale Data Analysis within the LIGO Scientific Collaboration Duncan Brown California Institute of Technology.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Scheduling Strategies for Mapping Application Workflows Onto the Grid A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor- Crummey, B. Liu, L. Johnsson.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

VgES Version 0.7 Release Overview UCSD VGrADS Team Andrew A. Chien, Henri Casanova, Yang-suk Kee, Jerry Chou, Dionysis Logothetis, Richard.

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Report on VGrADS EOT Activities VGrADS Workshop, February 24, 2006 Michael Sirois Center for Excellence and Equity in Education (CEEE)

VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University

- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

VGES Demonstrations Andrew A. Chien, Henri Casanova, Yang-suk Kee, Richard Huang, Dionysis Logothetis, and Jerry Chou CSE, SDSC, and CNS University of.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

The Virtual Grid Application Development Software (VGrADS) Project Overview Ken Kennedy Center for High Performance Software Rice University

The EMAN Application: An Update. EMAN Oversimplified Preliminary 3D Model Preliminary 3D model Particles Electron Micrographs Refine Final 3D model.

Clouds , Grids and Clusters

SC’07 Demo Draft VGrADS Team June 2007.

Center for High Performance Software

EMAN, Scheduling, Performance Prediction, and Virtual Grids

LEAD-VGrADS Day 1 Notes.

Center for High Performance Software

VGrADS Execution System or “the Virtual Grid”

Grid Computing.

Abstract Machine Layer Research in VGrADS

Theresa Chatman, (presenter), Michael Sirois Richard Tapia

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Presentation transcript:

TMC BioGrid A GCC Consortium Ken Kennedy Center for High Performance Software Research (HiPerSoft) Rice University

HiPerSoft NSF VGrADS ITR —GrADS project phasing out DOE Los Alamos Computer Science Institute (LACSI) —LANL, Rice, UH, Tennessee, UNC, UNM Collaborations with two NSF PACIs Telescoping Languages Project (NSF,DOE,DoD,Texas) —Domain languages based on Matlab and S DOE SciDAC Languages Project (John Mellor-Crummey) —Co-Array Fortran Gulf Coast Center For Computational Cancer Research —Rice and MD Anderson Cancer Center Two NSF Major Research Infrastructure Grants —Two teraflop clusters (Itanium and Opteron) Houston BioGrid

Texas Medical Center BioGrid A Partnership under the Gulf Coast Consortia —Participants: Rice, UH, Baylor College of Medicine, MD Anderson Cancer Center Goals —Foster research and development on application of Grid computing technology to biomedicine —Construct a useful Grid computational infrastructure Current Infrastructure —Machines: Itanium and Opteron clusters at UH, Itanium cluster at Rice, Pentium cluster at Baylor, MD Anderson pending —Interconnection: 10 Gigabit optical interconnect among Rice,UH, Baylor, MD Anderson in progress, connection to National Lamda Rail pending —Software: Globus + VGrADS software stack (see next slide)

BioGrid Principal Investigators Don Berry, MD Anderson Bradley Broom, MD Anderson Wah Chiu, Baylor Richard Gibbs, Baylor Lennart Johnsson, Houston Ken Kennedy, Rice Charles Koelbel, Rice John Mellor-Crummey, Rice Moshe Vardi, Rice

BioGrid Research Software Research —Virtual Grid Application Development Software (VGrADS) Project –Producing software that will make it easy to develop Grid applications with optimized performance –Automatic scheduling and launching on the Grid based on performance models –Distribution of software stack to construct testbeds Applications —EMAN: 3D Image Reconstruction Application Suite (Baylor-Rice-UH) –Automatically translated (by VGrADS) to Grid execution with load-balanced scheduling —Script-based integration and analysis of experimental cancer data bases planned (MD Anderson, Rice)

The VGrADS Team VGrADS is an NSF-funded Information Technology Research project Keith Cooper Ken Kennedy Charles Koelbel Linda Torczon Rich Wolski Fran Berman Andrew Chien Henri Casanova Carl Kesselman Lennart Johnsson Dan Reed Jack Dongarra Plus many graduate students, postdocs, and technical staff!

VGrADS Principal Investigators Francine Berman, UCSD Henri Casanova, UCSD Andrew Chien, UCSD Keith Cooper, Rice Jack Dongarra, Tennessee Lennart Johnsson, Houston Ken Kennedy, Rice Charles Koelbel, Rice Carl Kesselman, USC ISI Dan Reed, UIUC Richard Tapia Linda Torczon, Rice Rich Wolski, UCSB

National Distributed Problem Solving Database Supercomputer Database Supercomputer

VGrADS Vision Build a National Problem-Solving System on the Grid —Transparent to the user, who sees a problem-solving system Why don’t we have this today? —Complex application development –Dynamic resources require adaptivity –Unreliable resources require fault tolerance –Uncoordinated resources require management —Weak programming tools and models –Tied to physical resources –If programming is hard, the Grid will not not reach its potential What do we propose as a solution? —Virtual Grids (vgrids) raise level of abstraction —Tools exploit vgrids, provide better user interface

GrADSoft Architecture Config- urable Object Program Execution Environment Program Preparation System Performance Feedback Whole- Program Compiler Libraries Source Appli- cation Software Components Binder Performance Problem Real-time Performance Monitor Resource Negotiator Scheduler Grid Runtime System Negotiation

The Virtual Grid Application Development Software (VGrADS) Project Ken Kennedy Center for High Performance Software Rice University

The VGrADS Vision: National Distributed Problem Solving Where We Want To Be —Transparent Grid computing –Submit job –Find & schedule resources –Execute efficiently Where We Are —Low-level hand programming What Do We Need? —A more abstract view of the Grid –Each developer sees a specialized “virtual grid” —Simplified programming models built on the abstract view –Permit the application developer to focus on the problem Database Supercomputer

The Original GrADS Vision Config- urable Object Program Execution Environment Program Preparation System Performance Feedback Whole- Program Compiler Libraries Source Appli- cation Software Components Binder Performance Problem Real-time Performance Monitor Resource Negotiator Scheduler Grid Runtime System Negotiation

Lessons from GrADS Mapping and Scheduling for MPI Jobs is Hard –Although we were able to do some interesting experiments Performance Model Construction is Hard —Hybrid static/dynamic schemes are best —Difficult for application developers to do by hand Heterogeneity is Hard —We completely revised the launching mechanisms to support this —Good scheduling is critical Rescheduling/Migration is Hard —Requires application collaboration (generalized checkpointing) —Requires performance modeling to determine profitability Scaling to Large Grids is Hard —Scheduling becomes expensive

VGrADS Virtual Grid Hierarchy

Virtual Grids and Tools Abstract Resource Request —Permits true scalability by mapping from requirements to set of resources –Scalable search produces manageable resource set —Virtual Grid services permit effective scheduling –Fault tolerance, performance stability Look-Ahead Scheduling —Applications map to directed graphs –Vertices are computations, edges are data transfers —Scheduling done on entire graph –Using automatically-constructed performance models for computations –Depends on load prediction (Network Weather Service) Abstract Programming Interfaces —Application graphs constructed from scripts –Written in standard scripting languages (Python,Perl,Matlab)

Virtual Grids Goal: Provide abstract view of grid resources for application use —Will need to experiment to get the right abstractions Assumptions: —Underlying scalable information service —Shared, widely distributed, heterogeneous resources —Scaling and robustness for high load factors on Grid —Separation of the application and resource management system Basic Approach: —Specify vgrid as a hierarchy of … –Aggregation operators (ClusterOf, LooseBagOf, etc.) with … –Constraints (type of processor, installed software, etc.) and … –Application-based rankings (e.g. predicted execution time) —Execution system returns (candidate) vgrid, structured as request —Application can use as it sees fit, make further requests

Programming Tools Focus: Automating critical application-development steps —Building workflow graphs –From Python scripts used by EMAN —Scheduling workflow graphs –Heuristics required (problems are NP-complete at best) –Good initial results if accurate predictions of resource performance are available (see EMAN demo) —Constructing of performance models –Based on loop-level performance models of the application –Requires benchmarking with (relatively) small data sets, extrapolating to larger cases —Initiating application execution –Optimize and launch application on heterogeneous resources

VGrADS Demos at SC04 EMAN - Electron Microscopy Analysis [BCM, Rice, Houston] —3D reconstruction of particles from electron micrographs —Workflow scheduling and performance prediction to optimize mapping EMAN Refinement Process

EMAN Workflow Scheduling Experiment Testbed —64 dual processor Itanium IA-64 nodes (900 MHz) at Rice University Terascale Cluster [RTC] —60 dual processor Itanium IA-64 nodes (1300 MHz) at University of Houston [acrl] —16 Opteron nodes (2009 MHz) at University of Houston Opteron cluster [medusa] Experiment —Ran the EMAN refinement cycle and compared running times for “classesbymra”, the most compute intensive parallel step in the workflow —Determine the 3D structure of the ‘rdv’ virus particle with large input data [2GB]

Results: Efficient Scheduling We compared the following workflow scheduling strategies 1.Heuristic Scheduling with accurate performance models generated semi-automatically - HAP 2.Heuristic Scheduling with crude performance models based on CPU power of the resources - HCP 3.Random Scheduling with no performance models - RNP 4.Weighted random scheduling with accurate performance models - RAP We compared the makespan of the “classesbymra” step for the different scheduling strategies

Results: Efficient Scheduling Scheduling method # instances mapped to RTC (IA-64) # instances mapped to medusa (Opteron) # nodes picked at RTC # nodes picked at medusa Execution Time at RTC (minutes) Execution Time at medusa (minutes) Overall makespan (minutes) HAP HCP RNP RAP Set of resources: 50 RTC nodes, 13 medusa nodes HAP - Heuristic Accurate PerfModel HCP - Heuristic Crude PerfModel RNP - Random No PerfModel RAP - Random Accurate PerfModel

Results: Load Balance # instances mapped to RTC (IA-64) # instances mapped to medusa (Opteron) # instances mapped to acrl (IA-64) Execution Time at RTC (minutes) Execution Time at medusa (minutes) Execution time at acrl (minutes) Overall makespan (minutes) Set of resources: 43 RTC nodes, 14 medusa nodes, 39 acrl nodes Good load balance due to accurate performance models

Results: Accurateness of Performance Models Our performance models were pretty accurate — rank[RTC_node] / rank[medusa_node] = 3.41 — actual_exec_time[RTC_node] / actual_exec_time[medusa_node] = 3.82 — rank[acrl_node] / rank[medusa_node] = 2.36 — actual_exec_time[acrl_node] /actual_exec_time[medusa_node] = 3.01 Accurate relative performance model values result in efficient load balance of the classesbymra instances

Final Comments The TMC BioGrid —An effort to solve important problems in computational biomedicine —Use the Grid to pool resources –10 Gbps interconnect –Pooled computational resources at the particpating The Challenge —Making it easy to build Grid applications Our Approach —Build on the VGrADS tools effort –Performance model based scheduling on abstract Grids EMAN Challenge Problem —End goal: 3000 Opterons for 100 hours