Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.

Slides:



Advertisements
Similar presentations
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Advertisements

Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
Distributed Tier1 scenarios G. Donvito INFN-BARI.
Data Grids Darshan R. Kapadia Gregor von Laszewski
Data Gateways for Scientific Communities Birds of a Feather (BoF) Tuesday, June 10, 2008 Craig Stewart (Indiana University) Chris Jordan.
GridFTP: File Transfer Protocol in Grid Computing Networks
(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
Simo Niskala Teemu Pasanen
Network, Operations and Security Area Tony Rimovsky NOS Area Director
GridFTP Guy Warner, NeSC Training.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
The Client/Server Database Environment Ployphan Sornsuwit KPRU Ref.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.
TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.
Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
RDA Data Support Section. Topics 1.What is it? 2.Who cares? 3.Why does the RDA need CISL? 4.What is on the horizon?
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Distributed Data for Science Workflows Data Architecture Progress Report December 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Data Area Report Chris Jordan, Data Working Group Lead, TACC Kelly Gaither, Data and Visualization Area Director, TACC April 2009.
Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.
Network, Operations and Security Area Tony Rimovsky NOS Area Director
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.
GridFTP Guy Warner, NeSC Training Team.
1 GridFTP and SRB Guy Warner Training, Outreach and Education Team, Edinburgh e-Science.
Protocols and Services for Distributed Data- Intensive Science Bill Allcock, ANL ACAT Conference 19 Oct 2000 Fermi National Accelerator Laboratory Contributors:
New Development Efforts in GridFTP Raj Kettimuthu Math & Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, U.S.A.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Compute and Storage For the Farm at Jlab
PLM, Document and Workflow Management
An Overview of Data-PASS Shared Catalog
The Client/Server Database Environment
Research Data Archive - technology
An Introduction to Computer Networking
Presentation transcript:

Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009

What is the Data Working Group? Group of staff from all TeraGrid Resource Providers Responsible for planning and implementation of: –Software supporting data movement and management –Overall configuration of tools –Coordination with Software and other Working Groups Data Architecture effort –Smaller group of individuals –Looking at overall strategic plan for data in TG

Jordan’s Theory of Scientific Computation Campus TeraGrid Resource Data HPC Analysis/Viz Magic Results Analysis/Viz Data Campus TeraGrid Resource Profit

Data Collection and Preservation Data is the “stuff” of science Different forms: Simulation output, Sensor output, Experimental results, etc Data has significant, unpredictable reuse value Collections organize data from numerous sources Metadata for identification, search for location Preservation allows for long-term access and reuse

What tools do we provide? “Data Movement” Tools – get data from here to there and back again “Data Management” Tools – replicate, organize, and tag files, utilize databases “Data Collections” Tools – evolution of data management to include formal collections access

What resources are available? High-Performance Parallel File Systems –“Scratch” and “Work” Areas Archive systems –Use tapes for long-term storage –Very high capacity (petabytes or tens of petabytes) Wide-Area and Global file systems –Extension of parallel file systems over wide-area networks –One file system, available on multiple sites/resources

Kits and Information Services TeraGrid software organized into “capabilities” Capabilities collected in “Kits” Kits register their services and software TeraGrid Central Information Service collects this information for all RPs Users, applications query the Info Service

Data Movement Tools GridFTP Servers and Clients –Supports parallel transfers (threading) –Supports “striping” (use of multiple servers) –Globus-url-copy client allows selection of low-level options (network and storage block sizes, etc) –Not the simplest syntax Secure Copy (scp/ssh) –TeraGrid supports high-performance network extensions –Simple syntax, relatively easy to use –Not always as featureful as GridFTP UberFTP –FTP-like command-line client for GridFTP, other protocols

Data Management Tools Storage Resource Broker client –SDSC provides a TeraGrid-wide SRB service –Many data collections currently managed through SRB –SRB is now deprecated, being phased out Reliable File Transfer service –Globus utility for managing transfers –Uses database for persistent state storage –Support automated retry, transfers of file lists

Data Collections Tools Integrated Rule-Oriented Data System –In testing at TACC, SDSC –Supports storage of Data and Metadata –Supports management of data in archives and file systems, replication and checksum management –Can manage data based on programmable “rule engine” Database clients –JDBC/ODBC – implementation-independent DB interface –MySQL and Postgres clients for the most common open source databases –Some sites support Orale

Parallel File Systems GPFS and Lustre Multiple Servers, multiple disk arrays –Load is distributed across servers for high-performance –Files can be distributed on a per-file or per-block basis (striping) –Lustre allows per-directory and per-file user configuration of striping Basic technologies behind WAN file systems

Archive Systems Many different technologies and configurations HPSS and others use a custom command interface –Run special commands to store and retrieve files –Often referred to as “put and get” interfaces SAM-QFS at SDSC and TACC uses a file system interface –Looks just like any other file system –Can use GridFTP, SCP, etc to store and retrieve files –May have to wait to “stage” files from tape All archives support different classes of service with different storage characteristics

Classes of Service Disk-only –Never copy to tape, and/or never delete from disk –Used for small files –Often “bundle” many small files together for efficiency 1 Tape Copy –Copy to a tape, delete from disk –Most common type of service 2 Tape Copies –Replicate across two tapes in case of media failure –Usually has to be specially requested or configured Geographical replication – Coming soon …

Wide-Area File Systems Take advantage of parallel operation and wide network pipes Have been shown to utilize up to 30Gb/s cross- country Good for large datasets with distributed usage, i.e. compute-at-NICS, Visualize-at-TACC Significant technical accomplishment, still working to extend availability everywhere GPFS-WAN: SDSC, IU, NCSA (sometimes) Lustre-WAN: IU, PSC, LONI, TACC (coming soon)

Recommendations for New Users Develop a Data Management plan Understand your data workflow Understand the data resources you will use Automate the data workflow if possible Almost all data may be useful in collaboration Consider the long-term value of your data, and whether to donate it to a collection or organize it yourself

Input always welcome Data is an extraordinarily diverse field Lots of use cases, lots of needs Many needs have to do with policy, some have to do with tools Important to make sure we’re serving the user community Contact or with comments and