OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL 631-344-3621

Slides:

Advertisements

Similar presentations

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.

Advertisements

Evaluation of NoSQL databases for DIRAC monitoring and beyond

NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.

+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.

Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.

Hardware Upgrade Project George Palios. Contents Outlines the activities undertaken to upgrade the hardware for the Revenues, Benefits and NNDR Systems.

Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.

OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

Development of a noSQL storage solution for the Panda Monitoring System “Database Futures” Workshop at CERN June 7 th, 2011 Maxim Potekhin Brookhaven National.

Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL

Remote Site C Pilot Scheduler Pilots and Pilot Schedulers Jobs Statistics Production Dashboard Dynamic Data Movement Monitor Panda Server (Apache) Development.

Databases Illuminated

OSG Technology Area Brian Bockelman Area Coordinator’s Meeting February 15, 2012.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

OSG Area Coordinator’s Report: Workload Management May14 th, 2009 Maxim Potekhin BNL

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.

NOSQL DATABASE Not Only SQL DATABASE

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.

Julia Andreeva on behalf of the MND section MND review.

OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL

OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL

OSG Area Coordinator’s Report: Workload Management August 20 th, 2009 Maxim Potekhin BNL

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

TAG and iELSSI Progress Elisabeth Vinek, CERN & University of Vienna on behalf of the TAG developers group.

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

First results of the feasibility study of the Cassandra DB application in Panda Job Monitoring ATLAS S&C Week at CERN April 5 th, 2011 Maxim Potekhin for.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

OSG Area Coordinator’s Report: Workload Management June 3 rd, 2010 Maxim Potekhin BNL

BIG DATA/ Hadoop Interview Questions.

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Aaron Stanley King. What is SQL Azure? “SQL Azure is a scalable and cost-effective on- demand data storage and query processing service. SQL Azure is.

ATLAS Computing: Experience from first data processing and analysis Workshop TYL’10.

Scaling HDFS to more than 1 million operations per second with HopsFS

Computing Operations Roadmap

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

Maximum Availability Architecture Enterprise Technology Centre.

Savannah to Jira Migration

WLCG Service Interventions

Chapter 6: CPU Scheduling

Future Data Architecture Cloud Hosting at USGS

Data Lifecycle Review and Outlook

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

2 Workload Management: Panda Panda Monitoring:  Closer integration of the existing Panda Monitoring System with the Global Dashboard  Upgrade lowered in priority due to existing functionality in the Dashboard (ATLAS decision) Scalability of Panda:  Typical throughput almost doubled in the past 12 month, from about 250k daily jobs run globally, to almost 500k per day, with peak count of 713k in the final days of data reprocessing in Nov’10  That puts more pressure on the database (Oracle), which is used for keeping complete state of the system, monitoring and data mining for performance analysis  Data is heavily indexed and indexes can block during copying of data across tables  The DB engine sometimes make suboptimal choices when confronted with multiple indexes  In the fall of 2010, there were a few problem days after a series network outages:  resulting disbalance of data distribution across tables, lots of backlog be to copied hence decreased performance  Multiple DB optimizations have been implemented since, notably table partitions  Demonstrated increase in performance  Some queries are still problematic and require workarounds

3 Workload Management: WBS (new monitor code) on hold due to Atlas Management Decision  To be dropped? (monitor integration) progressing with the existing (old) code base (Daya Bay/LBNE) progress, ready for production (CHARMM expansion to 20+ sites) done, researchers happy New item – Panda Scalability – new database options  To be added?

4 Workload Management: Panda Scalability of Panda, cont’d:  Along with DB optimization, alternatives are being considered for storage of finalized job data (archive), where Oracle is redundant – looking at noSQL solutions in particular – such as Cassandra, HBASE etc  noSQL advantages (such as Cassandra):  When compared to traditional RDBMS, more cost-effective horizontal scaling with commodity hardware and media  Load-balanced, redundant, truly distributed system  Extremely fast sinking of data with proper configuration (important)  Demonstrated performance of noSQL solutions in industry (Amazon, Facebook, Twitter, Google etc)  In December 2010, started an evaluation of Cassandra with real Panda job data feed  Test cluster (3 nodes) located at CERN  Data repository at Amazon S3  First round of testing encouraging, data design ongoing  To be evaluated at the ATLAS Software Week at CERN in April

5 Workload Management: Engagement CHARMM:  Thanks to 17+ active sites used the recent run was expedient, according to the team  Resource requirements turned out to be pretty precise (encouraging)  The last wave of jobs is finishing right now and the data goes to the experimental group, only 408 jobs submitted in the past month LBNE/Daya Bay  Jobs ran at PDSF and BNL (J.Caballero), a number of issues discovered and resolved, such as:  Peculiarities of WN configuration at PDSF (version of curl)  Suboptimal job configuration resulted in some jobs running out of memory, which is now fixed  Additional software optimization was done by the researchers (MC)  An announcement went out on the Daya Bay mailing list that the initial production run will start in a few days  An additional cluster at IIT (Illinois) is under construction  Panda user documentation is being reviewed as per researchers’ request