1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South.

Slides:



Advertisements
Similar presentations
Globus FTP Evaluation test Catania – 10/04/2001Antonio Forte – INFN Torino.
Advertisements

Modeling Internet Application Traffic for Network Planning and Provisioning Takafumi Chujo Fujistu Laboratories of America, Inc.
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
Current Testbed : 100 GE 2 sites (NERSC, ANL) with 3 nodes each. Each node with 4 x 10 GE NICs Measure various overheads from protocols and file sizes.
AMAZON S3 FOR SCIENCE GRIDS: A VIABLE SOLUTION? Mayur Palankar and Adriana Iamnitchi University of South Florida Matei Ripeanu University of British Columbia.
S4: A Simple Storage Service for Sciences Matei Ripeanu Adriana Iamnitchi University of British Columbia University of South Florida.
Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.
Workflow analysis, design & development of remedy applications.
1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
What’s the Problem Web Server 1 Web Server N Web system played an essential role in Proving and Retrieve information. Cause Overloaded Status and Longer.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
The Power of Indirect Ties in Friend-to-Friend Storage Systems Xiang Zuo 1, Jeremy Blackburn 2, Nicolas Kourtellis 3, John Skvoretz 1 and Adriana Iamnitchi.
Packet Caches on Routers: The Implications of Universal Redundant Traffic Elimination Ashok Anand, Archit Gupta, Aditya Akella University of Wisconsin,
Copyright © 2005 Department of Computer Science CPSC 641 Winter WAN Traffic Measurements There have been several studies of wide area network traffic.
1 CSSE 477 – A bit more on Performance Steve Chenoweth Friday, 9/9/11 Week 1, Day 2 Right – Googling for “Performance” gets you everything from Lady Gaga.
Are P2P Data-Dissemination Techniques Viable in Today's Data- Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint.
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
SE 450 Software Processes & Product Metrics Reliability Engineering.
CS519 BGP Project Report Kai-Wen Chung (kc279) San-Yiu Cheng (sc345)
Distributed and Streaming Evaluation of Batch Queries for Data-Intensive Computational Turbulence Kalin Kanov Department of Computer Science Johns Hopkins.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
1 WAN Measurements Carey Williamson Department of Computer Science University of Calgary.
1© Copyright 2013 EMC Corporation. All rights reserved. Characterization of Incremental Data Changes for Efficient Data Protection Hyong Shim, Philip Shilane,
Network Coding vs. Erasure Coding: Reliable Multicast in MANETs Atsushi Fujimura*, Soon Y. Oh, and Mario Gerla *NEC Corporation University of California,
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Experience of a low-maintenance distributed data management system W.Takase 1, Y.Matsumoto 1, A.Hasan 2, F.Di Lodovico 3, Y.Watase 1, T.Sasaki 1 1. High.
Simulation II IE 2030 Lecture 18. Outline: Simulation II Advanced simulation demo Review of concepts from Simulation I How to perform a simulation –concepts:
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
IRODS performance test and SRB system at KEK Yoshimi KEK Building data grids with iRODS 27 May 2008.
An Efficient Approach for Content Delivery in Overlay Networks Mohammad Malli Chadi Barakat, Walid Dabbous Planete Project To appear in proceedings of.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
GridNM Network Monitoring Architecture (and a bit about my phd) Yee-Ting Li, 1 st Year UCL, 17 th June 2002.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South.
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Parallel TCP Bill Allcock Argonne National Laboratory.
Sharing Social Content from Home: A Measurement-driven Feasibility Study Massimiliano Marcon Bimal Viswanath Meeyoung Cha Krishna Gummadi NOSSDAV 2011.
1 On Dynamic Parallelism Adjustment Mechanism for Data Transfer Protocol GridFTP Takeshi Itou, Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
GridFTP GUI: An Easy and Efficient Way to Transfer Data in Grid
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Development of the CMS Databases and Interfaces for CMS Experiment: Current Status and Future Plans D.A Oleinik, A.Sh. Petrosyan, R.N.Semenov, I.A. Filozova,
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
File System Numbers 4/18/2002 Michael Ferguson
Support to scientific research on seasonal-to-decadal climate and air quality modelling Pierre-Antoine Bretonnière Francesco Benincasa IC3-BSC - Spain.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Database Management 7. course. Reminder Disk and RAM RAID Levels Disk space management Buffering Heap files Page formats Record formats.
A Total Recall of Data Usage for the MWA Long Term Archive
Fast Pattern-Based Throughput Prediction for TCP Bulk Transfers
Memory Parts of a computer
Named Data Networking in science Applications
Database Management Systems (CS 564)
CPSC 641: WAN Measurement Carey Williamson
Building a Database on S3
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.
Carey Williamson Department of Computer Science University of Calgary
Evaluation of Objectivity/AMS on the Wide Area Network
The Design and Implementation of a Log-Structured File System
Presentation transcript:

1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South Florida & Argonne National Laboratory

2 Metrics Project Dataset Start with ~137.5 million records (Jul’05 - Mar’07) ~22.8 million records with size: ≤0 ~1000 records with buffer size: <0 ~3.9 million records for directory listing ~4,600 records with invalid hostnames/IPs (e.g. ~11.4 million records from identified ANL-Teragrid testings ~16.8 million records: identified duplicate reports ~5.75 million records: self transfers (same hostname for source/destination) => In the end: ~77.2 million records or ~56.2%!

3 Server to server transfers => Duplicate Report of the same transfer -Criteria to identify duplicates: 1) Window of 5 records 2) Complementary stor_or_retr code (0 or 1) 3) Number of bytes, buffer size, block size same 4) Transfer time (end-start) within 1sec difference 5) Start (or end) time within 60sec difference 6) For more than one matching records, pick the smallest difference in transfer time.  16.8 million server-to-server transfers ~5.75 million records: self transfers (same hostname for source/destination) Metrics Project Dataset (Cont.)

4 Results (1): Transfer Size Distribution Notes: 1)1 st Peak: 16MB - 32MB, ~13 million records. 2)2 nd Peak: 512B - 1KB (low transfer size), ~7.4 million records. 3)3 rd Peak: 0-2B, ~5,2 million records. 4)Maximum bucket: 8TB-16TB, 45 records. 5)GB region buckets: ~255,000 records.

5 Results (2): Buffer Size Distribution Notes: 1)60% from the original table: 0B. 2)Most commonly used: 16 – 128KB. 3)Maximum bucket: 1-2GB, 92 records.

6 Results (3): Average Bandwidth Distribution Notes: 1)Peak: Mbps, ~7.7 million records 2)Most common: 4Mbps - 1Gbps of average bandwidth (58%)

7 Results (4): Number of Streams Distribution Notes: 1)~70% of the transfers used 1 stream!! 2)Only ~20% of the transfers used 4 streams (suggested number by ANL’s website), 3)Total 10% of the user base used other numbers of streams 4)Maximum of the CDF: 1010 streams(!!)

8 Results (5): Average Bandwidth VS Number of Streams Notes: 1)Only transfer sizes ≥1GB 2)Bandwidth increase of a factor 2 or 3, for streams > 10. 3)Bandwidth ceiling at Mbps, after ~32 streams => Gbps infrastructure

9 Results (6): Number of Stripes Distribution Summary of results: 1 Stripe: 99.5% of the transfers! 2-31 Stripes: 0.5% of the transfers!

10 Results (7): User and organization evolution over time Notes: 1) Continuing Increase on the evolution of the user and organization population. 2) Forecasts: 43 new IPs and 14 organizations (domains) per month.

11 Results (8): Geographical Characterization Notes: 1)USA: 78.4% or ~50.8 million transfers and 82.9% or ~1.3 PB in volume. 2)Some activity from Canada, Taiwan, Japan and Spain (~14 million transfers and 346TB in volume).

12 Results (9): Server to Server Transfers (a) Notes: 1)~ 257,000 transfers per month 2)Growth rate of ~27,000 transfers per month # of Transfers ___ Volume _ _ Linear fitting

13 Results (10): Server to Server Transfers (b) Notes: 1)The small % of Inter-Domain transfers but respective high % in Volume 2)The opposite for Intra-Domain (InterIP) transfers 3)High reporting of Self Transfers (more than 1/3).

14 Results (11): Year Round Comparison for the Number of transfers (per month) Comment: The ratio decreases as time goes by, suggesting a stabilizing trend, on the # of transfers.

15 Results (12): Year Round Comparison for the Volume of transfers (per month) Comment: The ratio decreases as time goes by, suggesting a stabilizing trend, on the Volume of transfers. There is an unexplainable (?) dramatic increase on the Dec 06!

16 DISCUSSION Open Questions: 1)How can the functionalities of GridFTP be explored more for: a) Better Performance (e.g. speedup of transfers streams, stripes etc)? b) Better Utilization of Resources (e.g. bandwidth, storage etc)? => Tutorials, suggestions, solutions-tools-applications to users. 2)How can the system evolution and usage analysis be useful for: a) Prediction and provisioning of the Globus Grid's resources b) Designing new benchmarks for evaluation of Grids’ resources like data transfer components and for more realistic simulations. c) Why aren’t the big players of GridFTP (CERN etc) reported? (Version or component?) d) How much does the version of the component affect the results? e) Bottom Line: Are these data logs representative of the GridFTP population? f) If not, then what component, would have representative logs, even with limited logging? 3)How can we improve the Usage Statistics (Metrics) Collection system? a) Efficient to the vast number of reports in the future (e.g. daily summaries?) b) Robust (to attacks, bogus data etc) c) Apply reporting of bugs in the user system (in a form of live feedback)? d) Add more details to the reports (like source AND destination(?)) e) Eliminate bugs in the reporting: i) Duplication ii) FTP response code (with conjunction to (c)), iii) Zero & Negative Buffer Size values (crosscheck the values used and the values allowed by the system) iv) Time Fields sometimes inconsistent, v) Use of the IP field, indexes on the DB to speedup the analysis. vi) Change in the schema of the DB. (monthly differences etc).