Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

Database System Concepts and Architecture
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
Introduction to Grid Computing Slides adapted from Midwest Grid School Workshop 2008 (
High Performance Computing Course Notes Grid Computing.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Workflows, Provenance and Virtual.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Swift: A Scientist’s Gateway to Campus Clusters, Grids and Supercomputers Swift project: Presenter contact:
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Managing Workflows with the Pegasus Workflow Management System
The Grid as Infrastructure and Application Enabler Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
End User Tools. Target environment: Cluster and Grids (distributed sets of clusters) Grid Protocols Grid Resources at UW Grid Storage Grid Middleware.
January, 23, 2006 Ilkay Altintas
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Ian Foster Argonne National Lab University of Chicago Globus Project The Grid and Meteorology Meteorology and HPN Workshop, APAN.
Applying the Virtual Data Provenance Model IPAW 2006 Yong Zhao, Ian Foster, Michael Wilde 4 May 2006.
The GriPhyN Virtual Data System Ian Foster for the VDS team.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
 The workflow description modified to output a VDS DAX.  The workflow description toolkit developed allows any concrete workflow description to be migrated.
The GriPhyN Virtual Data System GRIDS Center Community Workshop Michael Wilde Argonne National Laboratory 24 June 2005.
GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Virtual Data Workflows with the GriPhyN VDS Condor Week University of Wisconsin Michael Wilde Argonne National Laboratory 14 March 2005.
The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
LQCD Workflow Discussion Fermilab, 18 Dec 2006 Mike Wilde
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
1 Data Architecture Strawman - Grimshaw Important points Everything is a service (object) >All have a name (EPR) and an interface (type) One or more base.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Data and storage services on the NGS.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
1 Case Study: Business Intelligence & Customer Data Customer Support Web-based Dashboard VP Marketing SQL XSLT XML Data Grid Customer Data Customer Order.
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
Pegasus WMS Extends DAGMan to the grid world
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
A General Approach to Real-time Workflow Monitoring
Frieda meets Pegasus-WMS
Presentation transcript:

Data analysis in I2U2 I2U2 all-hands meeting Michael Wilde Argonne MCS University of Chicago Computation Institute 12 Dec 2005

Scaling up Social Science: Parallel Citation Network Analysis Work of James Evans, University of Chicago, Department of Sociology

Scaling up the analysis l Database queries of 25+ million citations l Work started on small workstations l Queries grew to month-long duration l With database distributed across U of Chicago TeraPort cluster: u 50 (faster) CPUs gave 100 X speedup u Many more methods and hypotheses can be tested! l Grid enables deeper analysis and wider access

Grids Provide Global Resources To Enable e-Science

Why Grids? eScience is the Initial Motivator … l New approaches to inquiry based on u Deep analysis of huge quantities of data u Interdisciplinary collaboration u Large-scale simulation and analysis u Smart instrumentation u Dynamically assemble the resources to tackle a new scale of problem l Enabled by access to resources & services without regard for location & other barriers … but eBusiness is catching up rapidly, and this will benefit both domains

Technology that enables the Grid l Directory to locate grid sites and services l Uniform interface to computing sites l Fast and secure data set mover l Directory to track where datasets live l Security to control access l Toolkits to create application services u Globus, Condor, VDT, many more

Virtual Data and Workflows l Next challenge is managing and organizing the vast computing and storage capabilities provided by Grids l Workflow expresses computations in a form that can be readily mapped to Grids l Virtual data keeps accurate track of data derivation methods and provenance l Grid tools virtualize location and caching of data, and recovery from failures

Virtual Data Process l Describe data derivation or analysis steps in a high-level workflow language (VDL) l VDL is cataloged in a database for sharing by the community l Workflows for Grid generated automatically from VDL l Provenance of derived results goes back into catalog for assessment or verification

Virtual Data Lifecycle l Describe u Record the processing and analysis steps applied to the data u Document the devices and methods used to measure the data l Discover u I have some subject images - what analyses are available? Which can be applied to this format? u I’m a new team member – what are the methods and protocols of my colleagues? l Reuse u I want to apply an image registration program to thousands of objects. If the results already exist, I’ll save weeks of computation. l Validate u I’ve come across some interesting data, but I need to understand the nature of the preprocessing applied when it was constructed before I can trust it for my purposes.

Virtual Data Workflow Abstracts Grid Details

Workflow - the next programming model?

Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data

A virtual data glossary l virtual data u defining data by the logical workflow needed to create it virtualizes it with respect to location, existence, failure, and representation l VDS – Virtual Data System u The tools to define, store, manipulate and execute virtual data workflows l VDT – Virtual Data Toolkit u A larger set of tools, based on NMI, VDT provides the Grid environment in which VDL workflows run l VDL – Virtual Data Language u A language (text and XML) that defines the functions and function calls of a virtual data workflow l VDC – Virtual Data Catalog u The database and schema that store VDL definitions

What must we “virtualize” to compute on the Grid? l Location-independent computing: represent all workflow in abstract terms l Declarations not tied to specific entities: u sites u file systems u schedulers l Failures – automated retry for data server and execution site un-availability

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort

Expressing Workflow in VDL TR grep (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR sort (in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV grep DV sort file1 file2 file3 grep sort Define a “function” wrapper for an application Provide “actual” argument values for the invocation Define “formal arguments” for the application Define a “call” to invoke application Connect applications via output-to-input dependencies

Essence of VDL l Elevates specification of computation to a logical, location-independent level l Acts as an “interface definition language” at the shell/application level l Can express composition of functions l Codable in textual and XML form l Often machine-generated to provide ease of use and higher-level features l Preprocessor provides iteration and variables

Using VDL l Generated directly for low-volume usage l Generated by scripts for production use l Generated by application tool builders as wrappers around scripts provided for community use l Generated transparently in an application- specific portal (e.g. quarknet.fnal.gov/grid) l Generated by drag-and-drop workflow design tools such as Triana

Basic VDL Toolkit l Convert between text and XML representation l Insert, update, remove definitions from a virtual data catalog l Attach metadata annotations to defintions l Search for definitions l Generate an abstract workflow for a data derivation request l Multiple interface levels provided: u Java API, command line, web service

Representing Workflow l Specifies a set of activities and control flow l Sequences information transfer between activities l VDS uses XML-based notation called “DAG in XML” (DAX) format l VDC Represents a wide range of workflow possibilities l DAX document represents steps to create a specific data product

Executing VDL Workflows Abstract workflow Local planner DAGman DAG Statically Partitioned DAG DAGman & Condor-G Dynamically Planned DAG VDL Program Virtual Data catalog Virtual Data Workflow Generator Job Planner Job Cleanup Workflow spec Create Execution Plan Grid Workflow Execution

OSG:The “target chip” for VDS Workflows Supported by the National Science Foundation and the Department of Energy.

VDS Applications ApplicationJobs / workflowLevelsStatus ATLAS HEP Event Simulation 500K1In Use LIGO Inspiral/Pulsar ~7002-5Inspiral In Use NVO/NASA Montage/Morphology 1000s7Both In Use GADU Genomics: BLAST,… 40K1In Use fMRI DBIC AIRSN Image Proc 100s12In Devel QuarkNet CosmicRay science <103-6 In Use SDSS Coadd; Cluster Search 40K 500K 2828 In Devel / CS Research FOAM Ocean/Atmos Model 2000 (core app runs CPU jobs) 3In use GTOMO Image proc 1000s1In Devel SCEC Earthquake sim 1000sIn use

A Case Study – Functional MRI l Problem: “spatial normalization” of a images to prepare data from fMRI studies for analysis l Target community is approximately 60 users at Dartmouth Brain Imaging Center l Wish to share data and methods across country with researchers at Berkeley l Process data from arbitrary user and archival directories in the center’s AFS space; bring data back to same directories l Grid needs to be transparent to the users: Literally, “Grid as a Workstation”

A Case Study – Functional MRI (2) l Based workflow on shell script that performs 12-stage process on a local workstation l Adopted replica naming convention for moving user’s data to Grid sites l Creates VDL pre-processor to iterate transformations over datasets l Utilizing resources across two distinct grids – Grid3 and Dartmouth Green Grid

Functional MRI Analysis Workflow courtesy James Dobson, Dartmouth Brain Imaging Center

Spatial normalization of functional run Dataset-level workflowExpanded (10 volume) workflow

Conclusion: Motivation for the Grid l Provide flexible, cost-effective supercomputing u Federate computing resources u Organize storage resources and make them universally available u Link them on networks fast enough to achieve federation l Create usable Supercomputing u Shield users from heterogeneity u Organize and locate widely distributed resources u Automate policy mechanisms for resource sharing l Provide ubiquitous access while protecting valuable data and resources

Grid Opportunities l Vastly expanded computing and storage l Reduced effort as needs scale up l Improved resource utilization, lower costs l Facilities and models for collaboration l Sharing of tools, data, and procedures and protocols l Recording, discovery, review and reuse of complex tasks l Make high-end computing more readily available

fMRI Dataset processing FOREACH BOLDSEQ DV reorient (# Process Blood O2 Level Dependent Sequence input = "$BOLDSEQ.hdr"} ], output = "$CWD/FUNCTIONAL/r$BOLDSEQ.hdr"}], direction = "y", ); END DV softmean ( input = [ FOREACH END ], mean = ] );

fMRI Virtual Data Queries Which transformations can process a “subject image”? l Q: xsearchvdc -q tr_meta dataType subject_image input l A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: l Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young l A: _anonymized.img Show files that were derived from patient image : l Q: xsearchvdc -q lfn_tree _anonymized.img l A: _anonymized.img _anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg _anonymized.sliced.img

Blasting for Protein Knowledge Blasting complete nr file for sequence similarity and function Characterization Knowledge Base PUMA is an interface for the researchers to be able to find information about a specific protein after having been analyzed against the complete set of sequenced genomes (nr file ~ approximately 2 million sequences) Analysis on the Grid The analysis of the protein sequences occurs in the background in the grid environment. Millions of processes are started since several tools are run to analyze each sequence, such as finding out protein similarities (BLAST), protein family domain searches (BLOCKS), and structural characteristics of the protein.

FOAM: Fast Ocean/Atmosphere Model 250-Member Ensemble Run on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Atmos Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (Workflow design and execution) Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N

FOAM: TeraGrid/VDSBenefits Climate Supercomputer TeraGrid with NMI and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison

Small Montage Workflow ~1200 node workflow, 7 levels Mosaic of M42 created on the Teragrid using Pegasus

LIGO Inspiral Search Application l Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group

US-ATLAS Data Challenge 2 Sep 10 Mid July CPU-day Event generation using Virtual Data

Provenance for DC2 How much compute time was delivered? | years| mon | year | |.45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | Selected statistics for one of these jobs: start: :33:56 duration: pid: 6123 exitcode: 0 args: JobTransforms /share/dc2.g4sim.filter.trf CPE_6785_ dc2_B4_filter_frag.txt utime: stime: minflt: majflt: Which Linux kernel releases were used ? How many jobs were run on a Linux Kernel?