Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.

Similar presentations


Presentation on theme: "Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson."— Presentation transcript:

1 Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson

2 Objective  Understanding the GridFTP log transfer data we have at NICS.  Analyze the data and identify areas of potential improvement.  Perform predictive analysis to improve efficiency.  Apply knowledge to XSEDE service providers. 2

3 NICS GridFTP Infrastructure 3

4 GridFTP Logging  Gridftp data transfer protocol version 5.2.2.  Two types of logging: "usage" logging and "log_transfer" logging (enabled in 5.2.2).  Prior to 5.2.2 endpoint IP address data was filled with 0.0.0.0.  Thanks to the Globus folks for fixing this bug! 4

5 Transfer Logs  NICS uses a PostgreSQL database for storing transfer log data.  Two new tables: n_gridftp_usage and n_gridftp_usage_detail.  n_gridftp_usage: quick lookup of aggregate monthly GridFTP usage information.  n_gridftp_usage_detail: Detailed records of each data transfer.  Log data includes: starttime, endtime, nbytes, user, filename, source and destination end points. 5

6 Log Data Collection  Data from each GridFTP server is copied to log files to a central NFS location.  Each month we run a processing script on the log files that checks for errors in the log entry.  Following this, we run a script to load the log files into database table.  We chose transfer log data for the year 2013 for this analysis. DATE=20130401132041.65 7463 HOST=datamover1.nics.ut k.edu PROG=globus- gridftp-server NL_EVNT=FTP_INFO START=2013041132041.53 4646 USER=username NBYTES=1048576 VOLUME=/ STREAMS=1 STRIPS=1 DEST=[192.249.6.164] TYPE=RETR CODE=226

7 Log Data Analysis  Two variables were identified: number of transfers and total amount of data transferred.  Data transfer rate based on starttime, endtime and nbytes.  Monthly visual comparison of data coming into and going out of NICS from everywhere.  Intra XSEDE site number of transfers and data transferred coming into and going out of NICS.  Bucketing of transfer data based on transfer size (ts).  R statistical computing language was used to plot all histograms and graphs. 7

8 Basic Statistics for the year 2013 TypeQuantity Total Transfers67,160,380 Average transfers per month5,596,698 File transfers ts > 64 GB813 (0.001%) File transfers 1 MB < ts < 64GB19,374,549 (28.85%) File transfers ts < 1 MB47,785,018 (71.15%) 8

9 Number of transfers and amount transferred for the year 2013 9 Number of transfers (in millions) Total = 83.54 millions Total amount transferred (in TB) Total = 1235.7millions Month Total amount transferred (in TB) Number of transfers (in millions) Mean

10 Percentage of transfers vs Transfer size for the year 2013 10 Total transfers: 67160380 Transfers size (ts) Percentage of transfers

11 Transfer speed for top 500 transfers with transfer size > 1GB 11 Month gbps

12 Monthly comparison between number of transfers coming into and going out of NICS for year 2013 12 Month Total number of transfers (in millions)

13 Monthly comparison between total amount of data coming into and going out of NICS for year 2013 13 Month Total amount of data moved (in TB)

14 Transfer data buckets for November 2013 14 All transfers for November 2013 Total transfers: 2181157 Transfer size (ts) Percentage of transfers All transfers for November 2013, ts < 1MB Total transfers: 749747 Percentage of transfers Transfer size (ts) All transfers for November 2013, 1MB < ts < 64GB Total transfers: 1431385 Percentage of transfers Transfer size (ts) All transfers for November 2013, ts > 64GB Total transfers: 25 Percentage of transfers Transfer size (ts)

15 Intra XSEDE Sites and Abbreviation Site Name Abbreviation Texas Advanced Computer CenterTACC Pittsburgh Supercomputing CenterPSC San Diego Supercomputer CenterSDSC National Institute for Computational Sciences/ Georgia Institute of Technology NICS/GaTech Indiana UniversityIU Open Science GridOSG National Center for Atmospheric Research NCAR 15

16 16 Intra XSEDE site data coming into NICS Number of transfers (in thousands) Total amount transferred (in TB) Month TACC PSC SDSC NICS/GaTech IU OSG NCAR

17 17 Intra XSEDE site data going out of NICS Month Number of transfers (in thousands) TACC PSC SDSC NICS/GaTech IU OSG NCAR Total amount transferred (in TB)

18 18 Intra XSEDE site data coming into and going out of NICS together TACC PSC SDSC NICS/GaTech IU OSG NCAR Number of transfers (in thousands) Total amount transferred (in TB) Month

19 Future Work  Currently in progress: –Moving from using PostgreSQL database to loading data completely in memory in a separate machine. –Using Apache Spark for fast large-scale data processing. –Combining SQL, streaming, and complex analytics. –Using advanced data mining and machine learning algorithms provided in libraries in Python.  Next Step: –Analyze by combing job data, filesystem data, and archive data for analysis. –Visualize data flow within XSEDE network on a geographical map. 19

20 Thank You!


Download ppt "Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson."

Similar presentations


Ads by Google