BigPanDA@Titan Implementation Plan Notes Sarp Oral
Task 3.4 implantation plan notes Following list focuses on file system and I/O pieces, specifically parallel file system access contention; file system performance resulting from heavy loads; and communication network (titan-spider 2) contention topics. These are covered in the numbered list in task 3.4 as: #3, “Monitor the impact of the I/O associated with the PanDA-enabled payloads on the wider Titan I/O performance)” and #4 “Quantify the marginal increase in network contention on Titan’s interconnection network”.
Specific tasks planned Using the GUIDE interface (guide.ccs.ornl.gov), obtain the list of all PanDA jobs executed for last weeks. Status: Completed. The query is written and provides the list of past PanDA jobs for any time duration. On average, we observe more than 3,000 PanDA jobs executed in a two-week period. Feed the PanDA job list to the Job I/O query in GUIDE and automate the process. This will provide per PanDA job data written and read, average I/O size, and OST usage in 5 minute intervals. Status: In progress. Current mechanism is manual and labor intensive. Need to write a Python script to automate. Dump the Job I/O query in step 2 to a flat file in a matrix format for post processing. Status: Not yet started.
Specific tasks planned, continued Statistically analyze the data obtained in step 3 to classify and identify the PanDA I/O characteristics. Status: Not yet started. Correlate the data from step 4 to Spider 2 file system I/O data and identify the impact of PanDA I/O on the file system. Identify the start and stop times of jobs documented in step 1 and feed and correlate the data with the the GUIDE Titan LNET Router activity. Currently this process in manual and labor intensive. Need to write a Python script to automate the process. Status: Not yet started. Statistically analyze the data obtained in step 6 and identify the possible Titan I/O interconnect contention. Status: Not eyet started.