June 6, 2007TeraGrid '071 Clustering the Reliable File Transfer Service Jim Basney and Patrick Duda NCSA, University of Illinois This material is based.

Slides:

Advertisements

Similar presentations

TeraGrid's GRAM Auditing & Accounting, & its Integration with the LEAD Science Gateway Stuart Martin Computation Institute, University of Chicago & Argonne.

Advertisements

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki

Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.

Prime’ Senior Project. Presentation Outline What is Our Project? Problem Definition What does our system do? How does the system work? Implementation.

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration Xiang Zhang 1,2, Zhigang Huo 1, Jie Ma 1, Dan Meng 1 1. National Research Center.

Zookeeper at Facebook Vishal Kathuria.

GridFTP: File Transfer Protocol in Grid Computing Networks

Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.

Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.

Oracle Clustering and Replication Technologies CCR Workshop - Otranto Barbara Martelli Gianluca Peco.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign This material is based upon work supported by the National Science.

Microsoft Clustering Sean Roberts, Jean Pierre SLAC.

High-Performance Task Distribution for Volunteer Computing Rom Walton

Kate Keahey Argonne National Laboratory University of Chicago Globus Toolkit® 4: from common Grid protocols to virtualization.

Module 10: Designing an AD RMS Infrastructure in Windows Server 2008.

The Data Replication Service Ann Chervenak Robert Schuler USC Information Sciences Institute.

The Hadoop Distributed File System

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

Progress on TeraGrid Stability for the LEAD project.

Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.

Globus Data Replication Services Ann Chervenak, Robert Schuler USC Information Sciences Institute.

TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,

GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Why GridFTP? l Performance u Parallel TCP streams, optimal TCP buffer u Non TCP protocol such as UDT u Order of magnitude greater l Cluster-to-cluster.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid and Cloud Computing Dr. Guy Tel-Zur. Today’s agenda UNICORE (see a separate presentation) AWS + Python (Boto) – ideas for projects… Hadoop (see a.

Deploying Oracle Names Jeff D’Abate Sr. Database Administrator Enterprise Application Services November 19, 2004.

Virtualization within FermiGrid Keith Chadwick Work supported by the U.S. Department of Energy under contract No. DE-AC02-07CH11359.

HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.

Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.

CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

Eric Westfall – Indiana University James Bennett – Indiana University ADMINISTERING A PRODUCTION KUALI RICE INFRASTRUCTURE.

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.

Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.

Communicating Security Assertions over the GridFTP Control Channel Rajkumar Kettimuthu 1,2, Liu Wantao 3,4, Frank Siebenlist 1,2 and Ian Foster 1,2,3 1.

ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2 TDTWG April 2, 2008.

An OGSI CredentialManager Service Jim Basney, Shiva Shankar Chetan, Feng Qin, Sumin Song, Xiao Tu National Center for Supercomputing Applications, University.

LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.

GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.

Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers TOR PDU Servers TOR PDU Servers TOR PDU.

Reliable File Transfer: Lessons Learned Bill Allcock, ANL Ravi Madduri, ANL.

Experiences with OGSA-DAI : Portlet Access and Benchmark Deepti Kodeboyina and Beth Plale Computer Science Dept. Indiana University.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

Cloud Computing Computer Science Innovations, LLC.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Data Manipulation with Globus Toolkit Ivan Ivanovski TU München,

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

DataGrid is a project funded by the European Commission EDG Conference, Heidelberg, Sep 26 – Oct under contract IST OGSI and GT3 Initial.

Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.

Current Globus Developments Jennifer Schopf, ANL.

Securing Grid Services – OGF19 > Thijs Metsch > securing_grid_services_ogf19.ppt > Slide 1 Application Level Gateway Securing services using.

Scaling Network Load Balancing Clusters

High Availability 24 hours a day, 7 days a week, 365 days a year…

Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.

Network Load Balancing

ERCOT SCR745 Update ERCOT Outage Evaluation Phase 1 and Phase 2

Maximum Availability Architecture Enterprise Technology Centre.

Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel

VIDIZMO Deployment Options

Be Fast, Cheap and in Control

Design Unit 26 Design a small or home office network

Leader Election Using NewSQL Database Systems

Caching 50.5* + Apache Kafka

Presentation transcript:

June 6, 2007TeraGrid '071 Clustering the Reliable File Transfer Service Jim Basney and Patrick Duda NCSA, University of Illinois This material is based upon work supported by the National Science Foundation under Grant No

June 6, 2007TeraGrid '072 Goal Provide a highly available Reliable File Transfer (RFT) Service –Tolerate server failures Hardware/software faults and resource exhaustion –Continue to handle incoming requests –Continue to make forward progress on file transfers in the queue

June 6, 2007TeraGrid '073 Globus Toolkit Reliable File Transfer Service RFTClient GridFTP

June 6, 2007TeraGrid '074 RFT and GridFTP Clustering GridFTP control RFT GridFTP data RFT

June 6, 2007TeraGrid '075 Clustering Approach RFT Load Balancer HA DBMS

June 6, 2007TeraGrid '076 Web Service Container RFT State Management RFT Delegation Service Client DBMS

June 6, 2007TeraGrid '077 RFT DB Tables RequestTransferRestart ID Termination Time Started Flag Max Attempts Delegated EPR Container ID Start Time ID Request ID Source URL Destination URL Status Attempts Retry Time Transfer ID Restart Marker Last Update Time Added Fields

June 6, 2007TeraGrid '078 New Tables Delegation ServicePersistent Subscription Resource ID Caller DN Local Name Termination Time Listener Certificate Container ID Consumer Producer Policy Precondition Selector Topic Security Descriptor …

June 6, 2007TeraGrid '079 RFT Fail-Over Based on time-outs Periodically query database for pending requests with no recent activity –Stalled requests could be caused by RFT service crash, hardware failure, RFT service overload, etc. –If found, obtain DB write lock, query again, claim stalled requests, and release lock Configuration values: –Query interval (default: 30 seconds) –Recent interval (default: 60 seconds)

June 6, 2007TeraGrid '0710 Evaluation Environment Dedicated 12 node Linux cluster –Red Hat Enterprise Linux AS Release 3 –Switched Gigabit Ethernet –2 GB RAM –dual 2GHz Intel Xeon CPUs 512KB cache Globus Toolkit MySQL Standard

June 6, 2007TeraGrid '0711 Evaluation Correctness / Effectiveness –Submitted multiple RFT requests of different sizes to 12 RFT instances –Verified fail-over and notification functionality Performance –Evaluate overhead of shared DBMS –Stress test: transfer many small files

June 6, 2007TeraGrid '0712 web services container stopped fail-over 60 second fail-over interval

June 6, 2007TeraGrid '0713

June 6, 2007TeraGrid '0714 4% 6% 10% 14% 22% 43% 57% 82% 95%

June 6, 2007TeraGrid '0715 Related Work HAND: Highly Available Dynamic Deployment Infrastructure for GT4 –Migrate services between containers to maintain availability during planned outages –Does not address management of persistent service state or fail-over for unplanned outages myGrid –DBMS persistence of WS-ResourceProperties in Apache WSRF –Points to a general-purpose approach for DBMS-based persistence of stateful WSRF services

June 6, 2007TeraGrid '0716 Conclusion Clustering RFT provides load-balancing and fail-over with acceptable performance for small clusters Clustering is a promising approach for application to other grid services

June 6, 2007TeraGrid '0717 Future Work Correctly handle replay of FTP deletes Implement credentialRefreshListener Evaluate use of different DBMS solutions Investigate GT4 DBMS persistence in general Investigate use of WS-Naming

June 6, 2007TeraGrid '0718 Thanks! Questions? Comments? This material is based upon work supported by the National Science Foundation under Grant No Performance experiments were conducted on computers at the Technology Research, Education, and Commercialization Center (TRECC), a program of the University of Illinois at Urbana-Champaign, funded by the Office of Naval Research and administered by the National Center for Supercomputing Applications. We thank Tom Roney for his assistance with the TRECC cluster. We also thank Ravi Madduri from the Globus project for answering our questions about RFT.