Download presentation
Presentation is loading. Please wait.
Published byHelen Anastasia Clark Modified over 8 years ago
1
Stephen Gowdy FNAL 9th Feb 2015CMS Computing Model Simulation 1
2
Want to look at different computing models To use caching Where to place caches How large they need to be Discussion with others to possibly collaborate Writing a basic Python simulation Can consider to change to C++ if better performance is needed 9th Feb 2015CMS Computing Model Simulation 2
3
Event driven discrete simulation Each job is the event Takes account of slots in sites Allows for perfect transfers between sites Can check limit for internal bandwidth of site Information on this not available in SiteDB Code is in https://github.com/gowdy/sitesimhttps://github.com/gowdy/sitesim 9th Feb 2015CMS Computing Model Simulation 3
4
Flat files read to load in site, network, job and file information Setup sites and links Next setup catalogue of data Read in simulation parameters for CPU efficiency and remote read penalty Start processing jobs in sequence Use list of jobs from dashboard to feed simulation See how it performs to process current jobs 9th Feb 2015CMS Computing Model Simulation 4
5
9th Feb 2015CMS Computing Model Simulation 5 site cpuTime inputData fractionRead start end runTime dataReadTime dataReadCPUHit theStore Job name disk bandwidth network [[site, bandwidth, quality, latency] … ] batch Site qjobs [ Job ] rjobs [ Job ] djobs [ Job ] cores bandwidth Batch catalogue {lfn:[site…]} files [(lfn, size) …] EventStore
6
Extracted from SiteDB pledge database Use information for 2014, most recent update If site has no pledge just assume 10TB and 100 slots Tier-2s default is larger, should probably update No internal bandwidth information so assume 20GB/s at all sites Recently started only considering US Tier-1 and Tier-2 sites Sizes taken by hand from REBUS (could probably automate also) Vanderbilt assumed to be the same as others 9th Feb 2015CMS Computing Model Simulation 6
7
Site, Start Time, Wall Clock, CPU time, files read, percentage of file read Latter isn’t available from dashboard Possible to get from xrootd monitoring, but how to link information? Just use the xrootd information statistically? Extracted job information from dashboard From 8pm 22 nd September till midnight About 4% of jobs have no site information (discarded) About 1% no CPU time (use wall clock) About 2% have no start time (use CPU time before end time) Will compare wall clock in simulation with actual for quality of simulation check Compare overall simulated wall clock time to compare different scenarios 9th Feb 2015CMS Computing Model Simulation 7
8
Extract network mesh from PhEDEx Using the links interface Also get reliability information If not present assumed 99% No actual transfer rate information available for links Use what is available to get a number between 1GB/s and 10GB/s, not at all accurate. Default 1GB/s. Extract file location information from PhEDEx No historical information is available When updating job information need to get an update for file locations Only get information on file used by jobs 740 of the 8939 looked like they read data remotely (but some will be due to stale PhEDEx info) 9th Feb 2015CMS Computing Model Simulation 8
9
9th Feb 2015CMS Computing Model Simulation 9 Startup output when only using US T1 and T2 sites; $ python python/Simulation.py Read in 9 sites. Read in 72 network links. Read in 9982 files. Read in 6728 locations. Read in 3 latency bins. Read in 10 job efficiency slots. About to read and simulate 2611 jobs... …
10
Need to add caching strategy later Including cache cleaning if getting full Cache hierarchy Currently simulation allows no transfers, or transfers. Also can discard transfers. Won’t transfer if there is no space available at a site Implement different models With new version of xrootd can read while still transferring Actual current model of reading remotely if not present should be added 9th Feb 2015CMS Computing Model Simulation 10
11
Run standard set of 2293 US jobs With transferring all data for a job in serial total wall clock time is ~86.4Ms 249 jobs need to transfer at least one file, taking total of 1263s Enabling remote read increases total time to ~87.5Ms This is only effected by jobs that don’t have data locally Need to update to reflect actual transfer times Currently idealised using whole bandwidth for every transfer Enabling parallel transfers (i.e. only considering longest one per job) reduces time 248 jobs need to transfer a file, taking total of 641s Fairly large variations due to random numbers, converted to use seeds 9th Feb 2015CMS Computing Model Simulation 11
12
Put all disk at the T1 Increases total job time to 99.3Ms. Add realistic transfer times Reallocate some disk space to CPU Increase the load on the system till it is full 9th Feb 2015CMS Computing Model Simulation 12
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.