Panda Notes for Discussion Torre Wenaus BNL ADC Retreat, Napoli Feb 3, 2011.

Slides:



Advertisements
Similar presentations
PanDA Integration with the SLAC Pipeline Torre Wenaus, BNL BigPanDA Workshop October 21, 2013.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Components and Architecture CS 543 – Data Warehousing.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
The PanDA Distributed Production and Analysis System Torre Wenaus Brookhaven National Laboratory, USA ISGC 2008 Taipei, Taiwan April 9, 2008 Torre Wenaus.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Infrastructure for Better Quality Internet Access & Web Publishing without Increasing Bandwidth Prof. Chi Chi Hung School of Computing, National University.
Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.
PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
Server Virtualization
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL
Status & Plan of the Xrootd Federation Wei Yang 13/19/12 US ATLAS Computing Facility Meeting at 2012 OSG AHM, University of Nebraska, Lincoln.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Remote Site C Pilot Scheduler Pilots and Pilot Schedulers Jobs Statistics Production Dashboard Dynamic Data Movement Monitor Panda Server (Apache) Development.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
DELETION SERVICE ISSUES ADC Development meeting
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
Distributed Data for Science Workflows Data Architecture Progress Report December 2008.
The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.
Conclusions on Monitoring CERN A. Read ADC Monitoring1.
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.
A Scalable and Resilient PanDA Service for Open Science Grid Dantong Yu Grid Group RHIC and US ATLAS Computing Facility.
SUSE Linux Enterprise Server for SAP Applications
HPC In The Cloud Case Study: Proteomics Workflow
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Simone Campana CERN IT-ES
Fine grained processing with an Event Service
PanDA in a Federated Environment
WLCG Service Interventions
WLCG Collaboration Workshop;
Cloud Computing R&D Proposal
Distributed Content in the Network: A Backbone View
Specialized Cloud Architectures
Presentation transcript:

Panda Notes for Discussion Torre Wenaus BNL ADC Retreat, Napoli Feb 3, 2011

DB Performance/Scalability - Archive  Major (biggest?) current issue  Major performance problems in archival DB, have had to shut off functionality (restored after tuning if possible)  Trajectory to resolving problems and restoring performance and scalability not really visible  Working closely with DBAs but we take a clear message – look at other potential solutions also  Cf. Vincent’s talk yesterday – noSQL  Currently looking at Cassandra, Google BigTable (no), Amazon SimpleDB  Job data for 2010 extracted to Amazon S3 csv, file table in progress  Send data to noSQL prototypes in parallel to Oracle while we evaluate; continue to work with DBAs. Not running away from Oracle Feb 3, 2011Torre Wenaus2

DB Performance – Live DB  Live DB (in-progress jobs and 3 day archive) less of a critical issue but a serious scalability concern also – deadly if we do hit serious problems  Look at expanded use of memcached, as live job state repository, with periodic (~1Hz?) flushes to Oracle  Panda server already has memcached service across the servers – can meet stateless requirement without relying solely on Oracle, preserving horizontal scalability  Possibility of no Panda downtime during Oracle downtime  Has to be tested and evaluated in a testbed  An idea – not implemented Feb 3, 2011Torre Wenaus3

Scale via multiple server/DB instances?  Consider revisiting >1 Panda server/DB, eg. at a Tier 1?  Please no  Been down that road, don’t go there  No scalability gain, no real availability gain (if CERN dies we die anyway), big maintenance load  Not necessary – better to scale central services horizontally (and solve the DB issue thoroughly) Feb 3, 2011Torre Wenaus4

Data access/scaling (apart from PD2P)  Event picking service – now extended to support tape retrieval of files with events on a selection list  For sparse selections. What about processing that hits all or most files in a tape resident sample? And maybe also for sparse selections if they become very popular?  Data carousel… sliding window… freight train scheme?  Cycle through tape data, processing all queued jobs requiring currently staged data (next slide)  Panda support for WAN direct access  Adapt memcached file catalog (*) as ‘cache memory’ for rebrokerage to maximize cache reuse  (*) implemented for WN-level file brokerage but unused so far…? Feb 3, 2011Torre Wenaus5

Efficient Tape-Resident Data Access  If we need something like a data carousel…?  Need a ‘carousel engine’: special PandaMover-like? job queue regulating tape staging for efficient data matching to jobs?  Brokerage must be globally aware of all jobs hitting tape to aggregate those using staged data  But (production) jobs split between proddb and PandaDB – Bamboo can handle it but is this a motivator for a merge? (next slide)  Continue/complete the move of group production to the production system!  Essential in order to process any tape-resident group production in an efficient way  Is our task-to-job translation optimal? Rather than pre-defined jobs are there advantages to dynamically defining jobs in an automated way based on efficiency considerations like cached data availability?  Would this tie in with analysis interest in a higher-level task organization of user jobs? Feb 3, 2011Torre Wenaus6

ProdDB/PandaDB  If it ain’t broke, don’t fix it. If it ain’t well motivated, don’t do it.  Are some good motivations appearing, especially looking towards a long shutdown?  The ‘global job view’ issue of carousel schemes  DB scale concerns – clean up redundant info  If we move away from utter dependence on Oracle, utter dependence still in ProdDB would be a tether  Any DB reengineering eg. to introduce some noSQL elements should involve a design optimization cycle – could make sense to include ProdDB/PandaDB rationalization Feb 3, 2011Torre Wenaus7

Analysis Issues  Rebrokerage complements PD2P – good to reduce time to rebrokerage and really try it! Monitoring a few days away  Panda/DaTRI mediated merging of small files (temporarily) for replication – where are we with this?  Doesn’t address small files on storage (files are unmerged on arrival) – do we need to?  Use dispatch datasets for analysis dataset movement for greater efficiency? Containers makes it practical Feb 3, 2011Torre Wenaus8

Panda in the Cloud  Very grateful to CERNVM folks for jumpstarting this – implemented ATLAS evgen job in a CERNVM-resident production pilot managed by their CoPilot, with Panda-side work from Paul and Tadashi  Then a quiet spell until Yushu Yao (LBNL) took up (DOE’s R&D cloud)  EC2-compatible interface (Eucalyptus)  He’s leaving for a promotion, sadly (for us!) Looking for replacement  Many possible approaches, everything on the table (next slide)  Need dedicated stable effort, if you believe like me it’s important!  Learn for ourselves the utility, performance, cost  Leverage free/available cloud resources Feb 3, 2011Torre Wenaus9

Panda in the Cloud Issues  Some of the things on the table…  Use Condor to submit pilots? Condor supports EC2 well. Same pilot factory as now.  Or, use a VM-resident pilot daemon? Requires some pilot adaptation, but should be minor  Use CERNVM’s CoPilot infrastructure? eg. Authorization management  What infrastructure will manage VM creation, lifetime based on queue content?  VM image generic or preloaded w/some workload specifics?  How to handle the storage? SRM in the cloud? (eck.) http! (rah.)  WAN direct access particularly interesting here? Minimize cloud storage costs Feb 3, 2011Torre Wenaus10

Performance/Behavior Studies  Active program in mining job history (with some Oracle struggle!) for studies of system performance and user behavior, e.g.  Popularity study based on user counts – see Sergey’s posted slides  Job splitting – are users over-splitting jobs and saddling themselves with inefficient throughput and long latency?  Much more we can learn from archival data  Performance differentials among workload types, sites, data access variants  User behavior to guide system optimization, user education, beneficial new features, …  Depends on, and motivates, a robust high performance solution to the archival DB Feb 3, 2011Torre Wenaus11

Security  A lesson long since learned: we can’t believe anything anyone tells us on glexec deployment schedules!  Will it ever be not only broadly deployed and perform sufficiently robustly and transparently that it’ll be left on?  What priority should we place on security measures in Panda to plug the security holes that using glexec would open?  Limited-lifetime token exchanged between Panda server/factory/pilot to secure against fake pilots stealing the user proxies that have to be provided to pilots to support glexec  Good security coverage on the web interface side with new Panda monitor manpower  Fast response to incident last fall  So far our only (known) hackers are our friends ;-) Feb 3, 2011Torre Wenaus12

Various  LFC consolidation – move output file registration to Panda server from pilot, discussed yesterday  dq2 based pilot data mover – discussed yesterday. Eager to have it!  ActiveMQ – follow the DDMers in this? Improvements over http esp. for high-rate messaging? eg. file-level callbacks  Next-generation job recovery system in progress  Adequacy of pilot testing tools? Use hammercloud for all?  off-grid Tier 3s: supported (adequately?) but no usage or great apparent interest – is there demand? Feb 3, 2011Torre Wenaus13