Globus Data Services for Science Raj Kettimuthu Argonne National Laboratory/Univ. of Chicago Ann Chervenak, Rob Schuler USC Information Sciences Institute.

Slides:



Advertisements
Similar presentations
Data Grids Darshan R. Kapadia Gregor von Laszewski
Advertisements

GridFTP: File Transfer Protocol in Grid Computing Networks
Application of GRID technologies for satellite data analysis Stepan G. Antushev, Andrey V. Golik and Vitaly K. Fischenko 2007.
Globus Toolkit 4 hands-on Gergely Sipos, Gábor Kecskeméti MTA SZTAKI
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
4b.1 Grid Computing Software Components of Globus 4.0 ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson, slides 4b.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
The Data Replication Service Ann Chervenak Robert Schuler USC Information Sciences Institute.
Grid Computing. What is a Grid? Many definitions exist in the literature Early definitions: Foster and Kesselman, 1998 –“A computational grid is a hardware.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Globus GridFTP: What’s New in 2007 Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
OPEN GRID SERVICES ARCHITECTURE AND GLOBUS TOOLKIT 4
Presented by The Earth System Grid: Turning Climate Datasets into Community Resources David E. Bernholdt, ORNL on behalf of the Earth System Grid team.
Globus Data Replication Services Ann Chervenak, Robert Schuler USC Information Sciences Institute.
GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.
Reliable Data Movement Framework for Distributed Science Environments Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
DataGrid Middleware: Enabling Big Science on Big Data One of the most demanding and important challenges that we face as we attempt to construct the distributed.
DISTRIBUTED COMPUTING
1 Introduction to Grid Computing. 2 What is a Grid? Many definitions exist in the literature Early definitions: Foster and Kesselman, 1998 “A computational.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Secure, Collaborative, Web Service enabled and Bittorrent Inspired High-speed Scientific Data Transfer Framework.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
File and Object Replication in Data Grids Chin-Yi Tsai.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Globus GridFTP and RFT: An Overview and New Features Raj Kettimuthu Argonne National Laboratory and The University of Chicago.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
UDT as an Alternative Transport Protocol for GridFTP Raj Kettimuthu Argonne National Laboratory The University of Chicago.
High Performance GridFTP Transport of Earth System Grid (ESG) Data 1 Center for Enabling Distributed Petascale Science.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Managed Object Placement Service John Bresnahan, Mike Link and Raj Kettimuthu (Presenting) Argonne National Lab.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
Web Portal Design Workshop, Boulder (CO), Jan 2003 Luca Cinquini (NCAR, ESG) The ESG and NCAR Web Portals Luca Cinquini NCAR, ESG Outline: 1.ESG Data Services.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
What is GridFTP? l High-performance, reliable data transfer protocol optimized for high-bandwidth wide-area networks l Based on FTP protocol - defines.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
Data Management and Transfer in High-Performance Computational Grid Environments B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, I. Foster, C. Kesselman,
Wide Area Data Replication for Scientific Collaborations Ann Chervenak, Robert Schuler, Carl Kesselman USC Information Sciences Institute Scott Koranda.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
CEDPS Data Services Ann Chervenak USC Information Sciences Institute.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
1 Overall Architectural Design of the Earth System Grid.
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
DataGrid is a project funded by the European Commission EDG Conference, Heidelberg, Sep 26 – Oct under contract IST OGSI and GT3 Initial.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
A Sneak Peak of What’s New in Globus GridFTP John Bresnahan Michael Link Raj Kettimuthu (Presenting) Argonne National Laboratory and The University of.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
OGSA-DAI.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Data Management Components for a Research Data Archive
Presentation transcript:

Globus Data Services for Science Raj Kettimuthu Argonne National Laboratory/Univ. of Chicago Ann Chervenak, Rob Schuler USC Information Sciences Institute

Globus Services for Data Intensive Science l Data Movement: u GridFTP and Reliable File Transfer Service (RFT) l Replica management: u Replica Location Service (RLS) and Data Replication Service (DRS) u New: Policy-based data placement service l Access to databases and other data sources: u OGSA Data Access and Integration (DAI) Service

Talk Outline l Examples of production data intensive science projects that use Globus services l New features: u GridFTP and RFT u Replica management tools u Data placement services u Data access and integration services

The LIGO Project l Laser Interferometer Gravitational Wave Observatory l LIGO instruments in Washington State and Louisiana l During science runs, produce up to 2 terabytes per day l Published along with metadata at Caltech (archival site) l Replicated at up to 10 other LIGO sites u LIGO scientists typically move data sets near to computational clusters at their sties l The LIGO Data Grid

Globus Services in the LIGO Data Grid l Lightweight Data Replicator (LDR): data management system developed by LIGO researchers l Globus data services: u GridFTP: used for moving data around the Grid efficiently and securely u Replica Location Service: catalogs deployed at all LIGO sites, keep track of locations of over 150 million files u Data Replication Service was developed to generalize the functionality in the LDR l Other Globus services: u Globus security u Starting to deploy the Globus Monitoring and Discovery Service

Earth System Grid objectives To support the infrastructural needs of the national and international climate community, ESG is providing crucial technology to securely access, monitor, catalog, transport, and distribute data in today’s grid computing environment. HPC hardware running climate models ESG Sites ESG Portal 6 Bernholdt_ESG_SC07

Main ESG Portal IPCC AR4 ESG Portal 146 TB of data at four locations l 1,059 datasets l 958,072 files l Includes the past 6 years of joint DOE/NSF climate modeling experiments 35 TB of data at one location l 77,400 files l Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change l Model data from 13 countries 4,910 registered users1,245 registered analysis projects Downloads to date l 30 TB l 106,572 files Downloads to date l 245 TB l 914,400 files l 500 GB/day (average) > 300 scientific papers published to date based on analysis of IPCC AR4 data ESG facts and figures Worldwide ESG user base IPCC Daily Downloads (through 7/2/07) Slide Courtesy of Dave Bernholdt, ORNL

ESG architecture and underlying technologies l Climate data tools u Metadata catalog u NcML (metadata schema) u OPenDAP-G (aggregation, subsetting) l Data management u Data Mover Lite u Storage Resource Manager l Globus toolkit u Globus Security Infrastructure u GridFTP u Monitoring and Discovery Services u Replica Location Service l Security u Access control u MyProxy u User registration Data Subsetting Access Control User Registration OPeNDAP-GMyProxy SRM DISK Cache ESG Web Portal NCAR Cache NCAR MSS RLSSRM ORNL HPSS RLSSRM RLS SRM RLS LANL Cache search browse download Web Browser Web Browser DML Data User publish Web Browser Web Browser Data Provider Monitoring Services Data Publishing Climate Metadata Catalogs Browsing Usage Metrics Data Download Data Search NERSC MSS, HPSS : Tertiary data storage systems First Generation ESG Architecture SRM

GridFTP Data Transfers for the Advanced Photon Source “One Australian user left nearly 1TB of data on our systems that we had been struggling to transfer via standard FTP for several weeks. The typical data rate using standard FTP was ~200 KB/s. Using GridFTP we are now moving data at 6 MB/s—quite a significant boost in performance!” Brian Tieman Advanced Photon Source 30x speedup 9688 miles

What’s New in Globus GridFTP and RFT Raj Kettimuthu Argonne National Laboratory and The University of Chicago

What is GridFTP? l High-performance, reliable data transfer protocol optimized for high-bandwidth wide-area networks l Based on FTP protocol - defines extensions for high- performance operation and security l We supply a reference implementation: u Server u Client tools (globus-url-copy) u Development Libraries l Multiple independent implementations can interoperate u Fermi Lab and U. Virginia have home grown servers that work with ours.

GridFTP l Two channel protocol like FTP l Control Channel u Communication link (TCP) over which commands and responses flow u Low bandwidth; encrypted and integrity protected by default l Data Channel u Communication link(s) over which the actual data of interest flows u High Bandwidth; authenticated by default; encryption and integrity protection optional

Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster data movement u Another order of magnitude l Support for reliable and restartable transfers l Multiple security options u Anonymous, password, SSH, GSI

Cluster-to-Cluster transfers

Performance l Mem. transfer between Urbana, IL and San Diego, CA

Performance l Disk transfer between Urbana, IL and San Diego, CA

Users l HEP community is basing its entire tiered data movement infrastructure for the LHC computing Grid on GridFTP l Southern California Earthquake Center (SCEC), European Space Agency, Disaster Recovery Center in Japan move large volumes of data using GridFTP l An average of more than 2 million data transfers happen with GridFTP every day

LOSF and Pipelining Traditional Pipelining l Significant performance improvement for LOSF File Request 1 File Request 2 File Request 3 DATA 1 DATA 2 DATA 3 ACK 1 ACK 2 ACK 3 File Request 1 File Request 2 File Request 3 DATA 1 DATA 2 DATA 3 ACK 1 ACK 2 ACK 3

GridFTP over UDT l GridFTP uses XIO for network I/O operations l XIO presents a POSIX-like interface to many different protocol implementations GSI TCP Default GridFTP GridFTP over UDT GSI UDT

GridFTP over UDT Argonne to NZ Throughput in Mbit/s Argonne to LA Throughput in Mbit/s Iperf – 1 stream Iperf – 8 streams GridFTP mem TCP – 1 stream GridFTP mem TCP – 8 streams GridFTP disk TCP – 1 stream GridFTP disk TCP – 8 streams GridFTP mem UDT GridFTP disk UDT UDT mem UDT disk

SSH Security for GridFTP sshd Client GridFTP Server Port 22 ROOT USER ssh Stdin/out

Multicast / Overlay Routing l Enable GridFTP to transfer single data set to many locations or act as an intermediate routing node

GridFTP with Lotman GridFTP Server Client Lotman

Reliable File Transfer Service ( RFT) RFT Service RFT Client SOAP Messages Notifications (Optional) GridFTP Server GridFTP Server CC DC Persistent Store l GridFTP client l WSRF complaint fault-tolerant service

RFT - Connection Caching l Control channel connections (and thus the data channels associated with it) are cached to reuse later (by the same user) RFT Service GridFTP Server GridFTP Server CC DC

RFT - Connection Caching l Reusing connections eliminate authentication overhead on the control and data channels l Measured performance improvement for jobs submitted using Condor-G l For 500 jobs - each job requiring file stageIn, stageOut and cleanup (RFT tasks) u 30% improvement in overall performance u No timeout due to overwhelming connection requests to GridFTP servers

What’s new in Data Access and Integration? Raj Kettimuthu on behalf of OGSA- DAI team

What is OGSA-DAI? l Middleware that allows data resources, such as relational or XML databases, to be accessed via web services

What is OGSA DAI? l OGSA-DAI executes workflows l OGSA-DAI is not just for data access, also does data updates, transformations and delivery.

OGSA DAI Workflow

Remote resource access l OGSA-DAI  data resource interaction u Via a data resource plug-in l Remote resource access u Access a data resource managed by another OGSA-DAI server

Remote resource access l Remote resource plug-in u Basically a client to a remote OGSA-DAI server u Runs queries via workflow submission u Configured with URL of remote server l Transparent to OGSA-DAI infrastructure u Just another data resource plug-in

OGSA-DAI 3.0 data sources l OGSA-DAI data sources u Resource for asynchronous data delivery l Data source service u Web service u Invoke GetFully via SOAP/HTTP u Use WS-Addressing to specify data source ID Expose via data source DataSource Client getFully() DataSourceService …data from workflow …

OGSA-DAI servlet l Data source servlet u Invoke HTTP GET u Use URL query string to specify data source ID Expose via data source DataSource Client HTTP GET DataSourceRetrievalServlet …data from workflow …

OGSA-DAI servlet l Useful for service orchestration and job submission u Taverna service-oriented workflow executor u Taverna could submit workflow to OGSA-DAI u OGSA-DAI returns URL u Taverna passes URL as part of job to job submission service l e.g. GRAM or GridSAM u Data is pulled from the URL when the job is executed l Advantages u Data is only moved when needed i.e. when the job executes u Job execution components need no OGSA-DAI- specific components

A join activity l Virtual Organisations for Trials and Epidemiological Studies (VOTES) u UK Medical Research Council project u Relational databases u Uses OGSA-DAI l OGSA-DAI team developed join activities

A join activity l This is equivalent to running: SELECT id, x, y FROM tableOne, tableTwo where table1.id = table2.myID; l Where tableOne and tableTwo are in two different databases Tuple merge join SELECT id, x FROM tableOne ORDER by id Run SQL query SELECT myID, y FROM tableTwo ORDER by myID joinColumn2: myIDjoinColumn1: id Run SQL query

SQL views l Imagine we have Patient and Doctor tables l SQL CREATE VIEW command l Define a DrPatient view to be u SELECT p.id, p.name, p.age, p.sex FROM Patient p, Doctor d WHERE p.DrID = d.ID; l Client runs SELECT * FROM DrPatient; l Shorthand for complex queries l Data access control u e.g. staff with only access to the DrPatient view will be unable to access a patient’s ZIP IDNameAgeSexZIPDr ID 1Ken42MIL Josie25FBN1 7QP789 IDNameDN 123GreeneUS-Chicago-G 456RossUS-Chicago-R 789FairheadUK-Holby-F

OGSA DAI SQL views l Layer above the database to implement views l Define views for databases to which you don’t have write access l Parses query l Maps view to SQL query over actual database l e.g if DrPatient was defined as u SELECT p.id, p.name, p.age, p.sex FROM Patient p, Doctor d WHERE p.DrID = d.ID AND d.dn = $DN$; u Can replace $DN$ by client’s DN from their certificate provided using GT4 security components u Doctors can only view their own patients l Factor in the client’s security credentials

OGSA-DQP l Distributed query processing u Multiple tables on multiple databases are exposed to clients multiple tables in one “virtual database” u Client is unaware of the multiple databases u Databases can be exposed within one OGSA-DAI server or exposed by remote OGSA-DAI servers l How it works u Query is parsed u Query plan is created u Query plan is executed – each database has sub-queries executed on it u Results are combined l Good for joins and unions

What’s new in data replication and placement services? Rob Schuler

Objectives for Data Replication A A A A A A Improve Durability Safeguard against data loss due to disk failure Improve Availability Safeguard against data inaccessibility due to network partition Improve Performance Safeguard against performance bottlenecks due to resource overload

The Globus Replica Location Service l Distributed registry l Records the locations of data copies l Allows replica discovery l RLS maintains mappings between logical identifiers and target names l Must perform and scale well: u support hundreds of millions of objects u hundreds of clients l Mature and stable component of the Globus Toolkit Replica Location Indexes Local Replica Catalogs

New Features in RLS l Embedded SQLite database for easier RLS deployment u Open source relational database backends (MySQL, PostgreSQL) depend on ODBC libraries u Compatibility problems that have made DB deployment difficult u Embedded DB back end now allows easy installation of RLS l Allows easier evaluation of RLS by potential users u SQLite offers good performance and scalability on queries u Does not support multiple simultaneous writers, so not suitable for some high performance environments

New Features in RLS l Pure Java client implementation u Long-awaited u Overcomes problems with JNI-based client, particularly on 64-bit platforms u Improves reliability of portals that use RLS Java client u Being used by several large applications (ESG, SCEC) l WS-RLS interface: provides a WS-RF compatible web services interface to RLS u Easier integration of RLS services into GT4 Web service environments

Data Placement Services: Motivation l Scientific applications often perform complex computational analyses that consume and produce large data sets u Computational and storage resources distributed in the wide area l The placement of data onto storage systems can have a significant impact on u performance of applications u reliability and availability of data sets l We want to identify data placement policies that distribute data sets so that they can be u staged into or out of computations efficiently u replicated to improve performance and reliability

Data Placement and Workflow Management l Studied relationship between asynchronous data placement services and workflow management systems u Workflow system can provide hints r.e. grouping of files, expected order of access, dependencies, etc. l Contrasts with many existing workflow systems u Explicitly stage data onto computational nodes before execution l Some explicit data staging may still be required l Data placement has potential to u Significantly reduce need for on-demand data staging u Improve workflow execution time l Experimental evaluation demonstrates that good placement can significantly improve workflow execution performance “ Data Placement for Scientific Applications in Distributed Environments, ” Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir Bharathi, Gaurang Mehta, Karan Vahi, in Proceedings of Grid 2007 Conference, Austin, TX, September 2007.

Approach: Combine Pegasus Workflow Management with Globus Data Replication Service Workflow Planner: Pegasus Data Placement Service: Globus DRS Compute Cluster Storage Elements Jobs Data Transfer Workflow Tasks Staging Request Setup Transfers

Replication occurs when… l Replica Placement u I want replica X at sites A, B, and C u I want N replicas of each file u I want replicas near my compute clusters l Replica Repair u Due to replica failure: lost or corrupted u But it can be hard to tell the difference between permanent and temporary failure!

Examples of Placement Policies Make N copies placed randomly on different sites Random One on my server, one on the same rack, one on another rack Topology-aware Query-based replication requests to push or pull data to make new replicas Publish/Subscribe Push replicas toward the “leaf” nodes (or access points) of the tree Tree-based dissemination Exploit locality of reference by creating replicas at any site where they are accessed Pervasive Place replicas at sites in order to optimize Quality-of-Service (QoS) criteria QoS Aware

Topology-Aware Placement client Site 1 Site 3 Site 2 1. Put Data 2. Replicate to 2 nd Local Site 3. Replicate to Remote Site The Topology Aware policy is a type of N-copy policy that (in this 3-copy example) ensures that replicas are distributed within and between sites

Publish/Subscribe Placement client Site 1 Site 3 Site 2 1.a. Publish Data “XYZ” 2. Query replica name service and replicate data sets The Publish/Subscribe policy is a query-based policy that identifies desired replicas based on a query and replicates them to the desired site client 1.b. Publish Data “QRS” client 1.c. Subscribe “XYZ” and “QRS”

Reactive vs. Proactive Replication l Reactive Replication u When a replica failure occurs, replicate u Difficult to tell the difference between a permanent replica failure and a temporary loss – e.g., temporary network partition l Proactive replication u Continually replicate files beyond the minimum required u Avoid bursts of network traffic to repair failures; limit bandwidth for repairs u Need creation rate >= failure rate