Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South.

Slides:



Advertisements
Similar presentations
Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
Advertisements

Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
AMAZON S3 FOR SCIENCE GRIDS: A VIABLE SOLUTION? Mayur Palankar and Adriana Iamnitchi University of South Florida Matei Ripeanu University of British Columbia.
S4: A Simple Storage Service for Sciences Matei Ripeanu Adriana Iamnitchi University of British Columbia University of South Florida.
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
IS333, Ch. 26: TCP Victor Norman Calvin College 1.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Augmenting Mobile 3G Using WiFi Sam Baek Ran Li Modified from University of Massachusetts Microsoft Research.
Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers Tsung-i (Mark) Huang Jaspal Subhlok University of Houston GAN ’ 05 / May 10, 2005.
Measurement, Modeling, and Analysis of a Peer-2-Peer File-Sharing Workload Presented For Cs294-4 Fall 2003 By Jon Hess.
1 School of Computing Science Simon Fraser University, Canada Modeling and Caching of P2P Traffic Mohamed Hefeeda Osama Saleh ICNP’06 15 November 2006.
Fresh Analysis of Streaming Media Stored on the Web Rabin Karki M.S. Thesis Presentation Advisor: Mark Claypool Reader: Emmanuel Agu 10 Jan, 2011.
Copyright © 2005 Department of Computer Science CPSC 641 Winter WAN Traffic Measurements There have been several studies of wide area network traffic.
An Analysis of Internet Content Delivery Systems Stefan Saroiu, Krishna P. Gommadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy Proceedings of.
Measurement, Modeling, and Analysis of a Peer-to-Peer File sharing Workload Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble, Henry.
Are P2P Data-Dissemination Techniques Viable in Today's Data- Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint.
Network Traffic Measurement and Modeling CSCI 780, Fall 2005.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Junxian Huang 1 Feng Qian 2 Yihua Guo 1 Yuanyuan Zhou 1 Qiang Xu 1 Z. Morley Mao 1 Subhabrata Sen 2 Oliver Spatscheck 2 1 University of Michigan 2 AT&T.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Can Internet Video-on-Demand Be Profitable? SIGCOMM 2007 Cheng Huang (Microsoft Research), Jin Li (Microsoft Research), Keith W. Ross (Polytechnic University)
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
GridFTP Guy Warner, NeSC Training.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Advanced Network Architecture Research Group 2001/11/149 th International Conference on Network Protocols Scalable Socket Buffer Tuning for High-Performance.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.
ALICE data access WLCG data WG revival 4 October 2013.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Page 1 CSISS Center for Spatial Information Science and Systems Design and Implementation of CWIC Metrics Weiguo Han, Liping Di, Yuanzheng Shao, Lingjun.
Brussels, 18 th March, RURAL WINGS IP Network traffic and reliability evaluation for the Rural Wings project Final Test Runs results - D7.6 - Patricia.
Software Performance Testing Based on Workload Characterization Elaine Weyuker Alberto Avritzer Joe Kondek Danielle Liu AT&T Labs.
UDT: UDP based Data Transfer Protocol, Results, and Implementation Experiences Yunhong Gu & Robert Grossman Laboratory for Advanced Computing / Univ. of.
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Advanced Network Architecture Research Group 2001/11/74 th Asia-Pacific Symposium on Information and Telecommunication Technologies Design and Implementation.
1 BWdetail: A bandwidth tester with detailed reporting Masters of Engineering Project Presentation Mark McGinley April 19, 2007 Advisor: Malathi Veeraraghavan.
1 On Dynamic Parallelism Adjustment Mechanism for Data Transfer Protocol GridFTP Takeshi Itou, Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
Spectrum of Support for Data Movement and Analysis in Big Data Science Network Management and Control E-Center & ESCPS Network Management and Control E-Center.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
GridFTP Richard Hopkins
Topic 3 Analysing network traffic
CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.
The Globus eXtensible Input/Output System (XIO): A protocol independent IO system for the Grid Bill Allcock, John Bresnahan, Raj Kettimuthu and Joe Link.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces Shyamala Doraimani* and Adriana Iamnitchi University of South.
1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South.
Globus Data Storage Interface (DSI) - Enabling Easy Access to Grid Datasets Raj Kettimuthu, ANL and U. Chicago DIALOGUE Workshop August 2, 2005.
GridFTP Guy Warner, NeSC Training Team.
1 GridFTP and SRB Guy Warner Training, Outreach and Education Team, Edinburgh e-Science.
On the scale and performance of cooperative Web proxy caching 2/3/06.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part VIII Web Performance Modeling (Book, Chapter 10)
System Administration Usage Reports. Agenda Usage Reports What are they Who can access them How to get them How to analyze the data collected Create basic.
Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers
FileSystems.
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
CPSC 641: WAN Measurement Carey Williamson
Presentation & Demo August 7, 2018 Bill Shelden.
Carey Williamson Department of Computer Science University of Calgary
Presentation transcript:

Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South Florida Dan Fraser Argonne National Laboratory

2

3 Objective 1: Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective 2: Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers?

4 Outline Metrics dataset Surprises and … … zoom in (TeraGrid) Lessons and discussions

5 GridFTP Metrics Dataset FieldRange of ValuesComment Source hostname/host IPString/IPnetAnonymized Start time of the transferTimestampAccuracy: ms End time of the transferTimestampAccuracy: ms TCP Buffer SizeInteger (Bytes)≥0 Total Number of BytesInteger (Bytes)≥0 Number of StreamsInteger≥1 Number of StripesInteger≥1 Store or RetrieveInteger (0, 1,2) STOR, RETR, LIST

6 Started with ~137.5 million records (Jul’05 - Mar’07) Cleaning: –transfer size ≤0: ~22.8 million records –buffer size <0: ~1000 records –directory listings: ~3.9 million records –invalid hostnames (e.g., ~4,600 records –ANL-TeraGrid testing: ~11.4 million records –duplicate reports: ~16.8 million records –self transfers (source=destination): ~5.75 million records Clean database: ~77.2 million records (~56.2%) Metrics Dataset

7 Surprise #1: Transfer Size Distribution Objective 1: Quantify volume of transfers

8 Zoom-in: TeraGrid Are these results representative for production grids? –GridFTP testing for deployment and learning Identify transfers from TeraGRid and analyze dataset.

9 Transfer Size Distribution (TG) Objective 1: Quantify volume of transfers

10 All TG

11 Why So Small Transfers? There are still many old versions (i.e., before v3.9.5) of GridFTP in use. These versions do not include trace reporting capabilities. Other data transfer protocols and implementations are used Users have turned off the reporting capability Some of the logs are inevitably lost due to the UDP-based reporting mechanism The low transfer volumes could suggest a shift towards data-aware job scheduling (?)

12 Server to Server Transfers –High reporting of Self Transfers (more than 1/3) Objective 1: Quantify volume of transfers

13 Top 6 Active Hosts (all) Top 6 hosts traffic adds up to ~28% of total volume Next 48 hosts (IPs) transferred 10s of TB Objective 1: Quantify volume of transfers

14 Number of Transfers & Volume (TG) Objective 1: Quantify volume of transfers

15 Average Transfer Size & Total Volume (TG) Objective 1: Quantify volume of transfers

16 Daily Workload (TG) Average volume transferred per day: ~ 0.6TB GridFTP doesn’t get weekends free! Objective 1: Quantify volume of transfers

17 Monthly Workload (TG) ~50,000 transfers per day ~1TB per day of total volume Lowest around 0.5TB per day Peaks due to particular days Objective 1: Quantify volume of transfers

18 Objective 1: Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective 2: Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers?

19 Surprise #2: Usage of Streams and Stripes Unreliably reported (from Globus team). Reliable observations: At least ~20% of the transfers used 4 streams (suggested number by ANL’s website)At least ~20% of the transfers used 4 streams (suggested number by ANL’s website) At least ~10% of the transfers used a different value, larger than one.At least ~10% of the transfers used a different value, larger than one. Maximum number of streams reported: 1010 (!!)Maximum number of streams reported: 1010 (!!) Objective 2: Understand how tuning capabilities are used

20 Buffer Size Distribution 60% from the original table: OS-controlled (0 bytes) 60% from the original table: OS-controlled (0 bytes) Most commonly used: 16–128KB Most commonly used: 16–128KB Largest buffer size: 1-2GB (92 records) Largest buffer size: 1-2GB (92 records) Objective 2: Understand how tuning capabilities are used

21 Average Bandwidth Distribution Peak: Mbps, ~7.7 million recordsPeak: Mbps, ~7.7 million records Most common (58%): 4Mbps—1GbpsMost common (58%): 4Mbps—1Gbps

22 Average Bandwidth Distribution (TG) Compared to the total dataset: => The region of 4Mbps to 1Gbps includes more than 85% of the transfers (58% for the whole dataset)

23 Objective 1: Quantify volume of transfers What is the transfer size distribution? What is the volume of activity for the most active hosts? Objective 2: Understand how tuning capabilities are used What are the buffer sizes used during the transfers? What is the average bandwidth? What is the utilization of functionalities like streams and stripes? Objective 3: Quantify user base and predict usage trends How does the user base evolve over time? What are the geographical characteristics of the GridFTP data transfers?

24 Surprise #3: Geographic Distribution USA: 78.4% or ~50.8 million transfers and 82.9% or ~1.7 PB Canada+Taiwan+Japan+Spain:~14M transfers and 346TB 49 different countries and 446 different cities (178 cities from USA) Objective 3: Quantify user base and predict usage trends

25 User and Domain Evolution (all) Continuing increase of user and organization population Forecasts: 67 new IPs and 14 new domains per month Objective 3: Quantify user base and predict usage trends

26 Summary of Results Many transfers in the range of KBs to 10s MB (peak in 16MB-32MB). –relevant for setting up realistic simulations. –previous work assume different, larger file sizes. Bandwidth measured in previous work is confirmed by our workload analysis. Tuning parameters: –Users tend not to set the buffer size explicitly (60%), leaving it to the OS –The unexpectedly small transfers do not justify tuning GridFTP parameters (stripes and streams) The usage of Globus GridFTP is growing over time in terms of IPs (users), domains (organizations), and volume transferred. Missing some of the big players

?