OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
2 Workload Management: Panda Panda Monitoring: Closer integration of the existing Panda Monitoring System with the Global Dashboard Upgrade lowered in priority due to existing functionality in the Dashboard (ATLAS decision) Scalability of Panda: Typical throughput almost doubled in the past 12 month, from about 250k daily jobs run globally, to almost 500k per day, with peak count of 713k in the final days of data reprocessing in Nov’10 That puts more pressure on the database (Oracle), which is used for keeping complete state of the system, monitoring and data mining for performance analysis Data is heavily indexed and indexes can block during copying of data across tables The DB engine sometimes make suboptimal choices when confronted with multiple indexes In the fall of 2010, there were a few problem days after a series network outages: resulting disbalance of data distribution across tables, lots of backlog be to copied hence decreased performance Multiple DB optimizations have been implemented since, notably table partitions Demonstrated increase in performance Some queries are still problematic and require workarounds
3 Workload Management: WBS (new monitor code) on hold due to Atlas Management Decision To be dropped? (monitor integration) progressing with the existing (old) code base (Daya Bay/LBNE) progress, ready for production (CHARMM expansion to 20+ sites) done, researchers happy New item – Panda Scalability – new database options To be added?
4 Workload Management: Panda Scalability of Panda, cont’d: Along with DB optimization, alternatives are being considered for storage of finalized job data (archive), where Oracle is redundant – looking at noSQL solutions in particular – such as Cassandra, HBASE etc noSQL advantages (such as Cassandra): When compared to traditional RDBMS, more cost-effective horizontal scaling with commodity hardware and media Load-balanced, redundant, truly distributed system Extremely fast sinking of data with proper configuration (important) Demonstrated performance of noSQL solutions in industry (Amazon, Facebook, Twitter, Google etc) In December 2010, started an evaluation of Cassandra with real Panda job data feed Test cluster (3 nodes) located at CERN Data repository at Amazon S3 First round of testing encouraging, data design ongoing To be evaluated at the ATLAS Software Week at CERN in April
5 Workload Management: Engagement CHARMM: Thanks to 17+ active sites used the recent run was expedient, according to the team Resource requirements turned out to be pretty precise (encouraging) The last wave of jobs is finishing right now and the data goes to the experimental group, only 408 jobs submitted in the past month LBNE/Daya Bay Jobs ran at PDSF and BNL (J.Caballero), a number of issues discovered and resolved, such as: Peculiarities of WN configuration at PDSF (version of curl) Suboptimal job configuration resulted in some jobs running out of memory, which is now fixed Additional software optimization was done by the researchers (MC) An announcement went out on the Daya Bay mailing list that the initial production run will start in a few days An additional cluster at IIT (Illinois) is under construction Panda user documentation is being reviewed as per researchers’ request