Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.

Slides:



Advertisements
Similar presentations
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Advertisements

Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
Kathy Benninger, Pittsburgh Supercomputing Center Workshop on the Development of a Next-Generation Cyberinfrastructure 1-Oct-2014 NSF Collaborative Research:
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
Copyright © 2003 Americas’ SAP Users’ Group Custom Archiving 101 Session Code 108 Karin Tillotson Sr. Basis Administrator Tuesday, May 20 th, 2003.
Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.
Chapter 14 The Second Component: The Database.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Simo Niskala Teemu Pasanen
1 Chapter Overview Transferring and Transforming Data Introducing Microsoft Data Transformation Services (DTS) Transferring and Transforming Data with.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Experience of a low-maintenance distributed data management system W.Takase 1, Y.Matsumoto 1, A.Hasan 2, F.Di Lodovico 3, Y.Watase 1, T.Sasaki 1 1. High.
Advanced Topics in Distributed Systems Fall 2011 Instructor: Costin Raiciu.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
CISC105 General Computer Science Class 1 – 6/5/2006.
Introduction to Hadoop and HDFS
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
1 Jo Lambert and Paul Meehan. JUSP aims Supports libraries by providing a single point of access to e-journal usage data Assists management of e- journals.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Data Transfers in the Grid: Workload Analysis of Globus GridFTP Nicolas Kourtellis, Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi University of South.
Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan.
Students: Anurag Anjaria, Charles Hansen, Jin Bai, Mai Kanchanabal Professors: Dr. Edward J. Delp, Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
HPSS for Archival Storage Tom Sherwin Storage Group Leader, SDSC
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
LCG Accounting John Gordon Grid Deployment Board 13 th January 2004.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Capacity Planning - Managing the hardware resources for your servers.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
Library Online Resource Analysis (LORA) System Introduction Electronic information resources and databases have become an essential part of library collections.
1 Workload Analysis of Globus’ GridFTP Nicolas Kourtellis Joint Work with:Lydia Prieto, Gustavo Zarrate, Adriana Iamnitchi, Dan Fraser University of South.
Remote & Collaborative Visualization. TACC Remote Visualization Systems Longhorn – Dell XD Visualization Cluster –256 nodes, each with 48 GB (or 144 GB)
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
1 GridFTP and SRB Guy Warner Training, Outreach and Education Team, Edinburgh e-Science.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
BIG DATA/ Hadoop Interview Questions.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
NetFlow Analyzer Best Practices, Tips, Tricks. Agenda Professional vs Enterprise Edition System Requirements Storage Settings Performance Tuning Configure.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Architecture Review 10/11/2004
Performance measurement of transferring files on the federated SRB
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Matt Link Associate Vice President (Acting) Director, Systems
Bridging the Data Science and SQL Divide for Practitioners
Spark Presentation.
Existing Perl/Oracle Pipeline
SQL Server Integration Services
Cloud based Open Source Backup/Restore Tool
Introduction to Spark.
Modeling and Optimizing Large-Scale Wide-Area Data Transfers
Overview of big data tools
DriveScale Log Collection Method of Procedure
CyberShake Study 18.8 Technical Readiness Review
Presentation transcript:

Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson

Objective  Understanding the GridFTP log transfer data we have at NICS.  Analyze the data and identify areas of potential improvement.  Perform predictive analysis to improve efficiency.  Apply knowledge to XSEDE service providers. 2

NICS GridFTP Infrastructure 3

GridFTP Logging  Gridftp data transfer protocol version  Two types of logging: "usage" logging and "log_transfer" logging (enabled in 5.2.2).  Prior to endpoint IP address data was filled with  Thanks to the Globus folks for fixing this bug! 4

Transfer Logs  NICS uses a PostgreSQL database for storing transfer log data.  Two new tables: n_gridftp_usage and n_gridftp_usage_detail.  n_gridftp_usage: quick lookup of aggregate monthly GridFTP usage information.  n_gridftp_usage_detail: Detailed records of each data transfer.  Log data includes: starttime, endtime, nbytes, user, filename, source and destination end points. 5

Log Data Collection  Data from each GridFTP server is copied to log files to a central NFS location.  Each month we run a processing script on the log files that checks for errors in the log entry.  Following this, we run a script to load the log files into database table.  We chose transfer log data for the year 2013 for this analysis. DATE= HOST=datamover1.nics.ut k.edu PROG=globus- gridftp-server NL_EVNT=FTP_INFO START= USER=username NBYTES= VOLUME=/ STREAMS=1 STRIPS=1 DEST=[ ] TYPE=RETR CODE=226

Log Data Analysis  Two variables were identified: number of transfers and total amount of data transferred.  Data transfer rate based on starttime, endtime and nbytes.  Monthly visual comparison of data coming into and going out of NICS from everywhere.  Intra XSEDE site number of transfers and data transferred coming into and going out of NICS.  Bucketing of transfer data based on transfer size (ts).  R statistical computing language was used to plot all histograms and graphs. 7

Basic Statistics for the year 2013 TypeQuantity Total Transfers67,160,380 Average transfers per month5,596,698 File transfers ts > 64 GB813 (0.001%) File transfers 1 MB < ts < 64GB19,374,549 (28.85%) File transfers ts < 1 MB47,785,018 (71.15%) 8

Number of transfers and amount transferred for the year Number of transfers (in millions) Total = millions Total amount transferred (in TB) Total = millions Month Total amount transferred (in TB) Number of transfers (in millions) Mean

Percentage of transfers vs Transfer size for the year Total transfers: Transfers size (ts) Percentage of transfers

Transfer speed for top 500 transfers with transfer size > 1GB 11 Month gbps

Monthly comparison between number of transfers coming into and going out of NICS for year Month Total number of transfers (in millions)

Monthly comparison between total amount of data coming into and going out of NICS for year Month Total amount of data moved (in TB)

Transfer data buckets for November All transfers for November 2013 Total transfers: Transfer size (ts) Percentage of transfers All transfers for November 2013, ts < 1MB Total transfers: Percentage of transfers Transfer size (ts) All transfers for November 2013, 1MB < ts < 64GB Total transfers: Percentage of transfers Transfer size (ts) All transfers for November 2013, ts > 64GB Total transfers: 25 Percentage of transfers Transfer size (ts)

Intra XSEDE Sites and Abbreviation Site Name Abbreviation Texas Advanced Computer CenterTACC Pittsburgh Supercomputing CenterPSC San Diego Supercomputer CenterSDSC National Institute for Computational Sciences/ Georgia Institute of Technology NICS/GaTech Indiana UniversityIU Open Science GridOSG National Center for Atmospheric Research NCAR 15

16 Intra XSEDE site data coming into NICS Number of transfers (in thousands) Total amount transferred (in TB) Month TACC PSC SDSC NICS/GaTech IU OSG NCAR

17 Intra XSEDE site data going out of NICS Month Number of transfers (in thousands) TACC PSC SDSC NICS/GaTech IU OSG NCAR Total amount transferred (in TB)

18 Intra XSEDE site data coming into and going out of NICS together TACC PSC SDSC NICS/GaTech IU OSG NCAR Number of transfers (in thousands) Total amount transferred (in TB) Month

Future Work  Currently in progress: –Moving from using PostgreSQL database to loading data completely in memory in a separate machine. –Using Apache Spark for fast large-scale data processing. –Combining SQL, streaming, and complex analytics. –Using advanced data mining and machine learning algorithms provided in libraries in Python.  Next Step: –Analyze by combing job data, filesystem data, and archive data for analysis. –Visualize data flow within XSEDE network on a geographical map. 19

Thank You!