General SAM Hints and Tips Adam Lyon (FNAL/CD/D0) GCAS Meeting 12/19/02 Removing Special RunsChecking Sam’s Health Cutting on Detector QualitySamRoot DB.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2007 Training Retrieve, back up, or share messages Sweetwater ISD presents:
Advertisements

© 2012 Entrinsik, Inc. Informer Administration Exploring the system menu and functions PRESENTER: Jason Vorenkamp| Informer Software Engineer| March 2012.
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
Parasol Architecture A mild case of scary asynchronous system stuff.
OpenVMS System Management A different perspective by Andy Park TrueBit b.v.
Analysis and Performance Information Systems 337 Prof. Harry Plantinga.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Backup and Recovery Part 1.
Manage your mailbox V: Retrieve, back up, or share messages Use your stored messages Whether you’re using the Personal Folders method or the Archive method.
Objectives of the Lecture :
How Do I Find a Job to Apply to?
Microsoft Windows 2003 Server. Client/Server Environment Many client computers connect to a server.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
I have attached a file to this by selecting the paperclip on the bottom of the page.
CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.
Linux Operations and Administration
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
S. Veseli - SAM Project Status SAMGrid Developments – Part I Siniša Veseli CD/D0CA.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
1 Instant Data Warehouse Utilities Extended (Again!!) 14/7/ Today I am pleased to announce the publishing of some fantastic new functionality for.
8th November 2002Tim Adye1 BaBar Grid Tim Adye Particle Physics Department Rutherford Appleton Laboratory PP Grid Team Coseners House 8 th November 2002.
- Iain Bertram R-GMA and DØ Iain Bertram RAL 13 May 2004 Thanks to Jeff Templon at Nikhef.
SamGrid– A Reality of “Grid” Computing –SamGrid– Adam Lyon (Fermilab Computing Division and DØ Experiment) GridKa School’04 September, 2004 Outline Introduction.
Computer Security! Emma Campbell, 8K VirusesHackingBackups.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Diagnostic Pathfinder for Instructors. Diagnostic Pathfinder Local File vs. Database Normal operations Expert operations Admin operations.
SAM Installation Lauri Loebel Carpenter and the SAM Team February
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
Distributed Backup And Disaster Recovery for AFS A work in progress Steve Simmons Dan Hyde University.
By Shanna Epstein IS 257 September 16, Cnet.com Provides information, tools, and advice to help customers decide what to buy and how to get the.
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Elizabeth Gallas August 9, 2005 CD Support for D0 Database Projects 1 Elizabeth Gallas Fermilab Computing Division Fermilab CD Grid and Data Management.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
SAM - Sequential Data Access via Metadata Schema Metadata Functionality Workshop Glasgow University April 26-28,2004.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
AJAX 10 Most Common Mistakes. 1. Not giving immediate visual cues for clicking widgets. If something I'm clicking on is triggering Ajax actions, you have.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
1 Day 2 Logging in, Passwords, Man, talk, write. 2 Logging in Unix is a multi user system –Many people can be using it at the same time. –Connections.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
RunControl status update Nicolas Lurkin School of Physics and Astronomy, University of Birmingham NA62 TDAQ Meeting – CERN, 14/10/2015.
Analysis Tools at D0 PPDG Analysis Grid Computing Project, CS 11 Caltech Meeting Lee Lueking Femilab Computing Division December 19, 2002.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
Internet Flow By: Terry Hernandez. Getting from the customers computer onto the internet Internet Browser
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
Introduction to the SAM System at DØ Physics 5391 July 1, 2002 Mark Sosebee U.T. Arlington.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Ch. 31 Q and A IS 333 Spring 2016 Victor Norman. SNMP, MIBs, and ASN.1 SNMP defines the protocol used to send requests and get responses. MIBs are like.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
SQL Replication for RCSQL 4.5
Cleveland SQL Saturday Catch-All or Sometimes Queries
David Adams Brookhaven National Laboratory September 28, 2006
1 VO User Team Alarm Total ALICE ATLAS CMS
Cloud based Open Source Backup/Restore Tool
UM D0RACE STATION Status Report Chunhui Han June 20, 2002
DØ MC and Data Processing on the Grid
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

General SAM Hints and Tips Adam Lyon (FNAL/CD/D0) GCAS Meeting 12/19/02 Removing Special RunsChecking Sam’s Health Cutting on Detector QualitySamRoot DB Query TipsFuture tools UDP problem What to do if slow responseSam on clued0

2 A. Lyon (FNAL/D0) – 2002 Notes  I’m not a super-SAM expert  Have little experience with CAB (see Heidi’s talk)  But, I’m involved in tools development with SAM (so I go to the SAM meetings)  I know what the SAM team is worried about  Heidi will give some tips too (Parallel jobs, Resubmission)

3 A. Lyon (FNAL/D0) – 2002 Removing special runs from dataset  Problem: You want to make a dataset definition but exclude special runs  Solution: Use onl_run_type dimension. --dim=‘run_number and data_tier raw and onl_run_type Physics’ Valid values are Calibration, Cosmics, Physics, Special, Test. Note dimension queries are case-sensitive (use the capitalization above)  Confusion: There is also a run_type dimension. Valid values are: ( calibration, detector, detector data, monte carlo, online archive, other, physics data taking, run1, vlpc characterization ). Use to separate collider data from MC. Note all lowercase.

4 A. Lyon (FNAL/D0) – 2002 Cutting on detector quality  Problem: You want runs where the muon system had good quality  Solution: Use run_qual_group and run_quality dimensions together. --dim=' run_number and data_tier raw and (run_qual_group MUO and run_quality REASONABLE)‘  Unfortunately, you can’t do multiple groups (yet).

5 A. Lyon (FNAL/D0) – 2002 DB query tips  Avoid querying on the file name when you can use meta-data (use data_tier, physical_datastream_name, appl_name, run_number, … )  If you must query on the file name, never put % first (e.g. %168573%)  You lose all indexing – DB will have to scan EVERY entry (millions).  Will be horribly slow and you will bog down DB server  If you find you can’t avoid this usage, send mail to sam-admin so experts can come up with changes to the meta-data.

6 A. Lyon (FNAL/D0) – 2002 Use of __set__  To get list of files for a dataset from sam translate constraint –dim=“…”  To see the files in your snapshot (what you ran over last) do… dataset_def_name “my_data_set_name” This just retrieves the list of files in the snapshot It’s FAST! But you can’t apply additional constraints  To see if any new files would be included in the snapshot, use __set__ “my_data_set_name” This retrieves the constraints set for the dataset and runs it Can be VERY slow, especially with MC (since lots of MC parameters) SAM people are working to make this better!

7 A. Lyon (FNAL/D0) – 2002 SAM DB Schema Changes  CDF is asking for changes to the SAM DB  We have some of our own:  Separate MC and data more easily (store file type in main database table)  Faster lookup to RAW parent file  “Unofficial” file submission  Discussions still ongoing

8 A. Lyon (FNAL/D0) – 2002 UDP problem (CAB/CLUED0)  Jobs crash on CAB – a file already consumed appears to be delivered again  Some SAM communication is done with UDP packets.  When a file is delivered, your project sends a UDP acknowledge packet to the station to say it got the file.  For some reason, station never receives packet.  Station assumes file wasn’t delivered and it tries to deliver again.  But your job in fact consumed the file. When it is delivered again, your job throws an exception which is not caught by the framework.

9 A. Lyon (FNAL/D0) – 2002 UDP problem  Seen mostly on CAB  Seems to be correlated with CPU load  Increased timeout on station (d0mino station already upgraded). Stopped crashes.  Seen rarely on CLUED0  Load is much lighter  Long term solution is to move to CORBA for SAM communications  Not seen on d0mino: all communication is internal

10 A. Lyon (FNAL/D0) – 2002 What to do when response is slow  Problem: sam commands take a long time to respond  This may be a d0mino system problem. Check the d0mino performance metrics. A good day (12/17)A very bad day (12/12) A root job was writing to a full disk. Irix got confused and d0mino became bogged down. (Root/Irix problem)

11 A. Lyon (FNAL/D0) – 2002 Slow response continued  Problem: Misweb/DDE web sites are slow  Usually caused by usage of d0ora1 (production oracle DB server). Everybody talks to d0ora1.  Note that response will be slow early morning (FNAL time) due to backups and other jobs.  If CPU is maxed then DB response will certainly be slow. But DB can be slow for other reasons too.

12 A. Lyon (FNAL/D0) – 2002 Is SAM healthy?  A quick check is: > setup sam > time sam locate foo No such file: foo 1  Timing on clued0 should be around 2.5s user, 0.05s system d0mino is longer (8s user, 5s system) If the system time >> user time, something could be wrong with the system

13 A. Lyon (FNAL/D0) – 2002 SamRoot  Instead of copying root files to your area with SAM, run SAM within root (Sam-Root Serial Interface)  See  Use sam_client_api A user interface to SAM from C++/root  Caveats:  Only D0mino  Can only read files, not store  It’s version may be (slightly) different than D0RunII  Like runProject.py, but it runs within root!  Are people using this interface (Sam-team has gotten little feedback)? Please give it a try. Send experiences (good & bad) to sam-admin.

14 A. Lyon (FNAL/D0) – 2002 Sam response to problems  In the past, response by SAM shifters to problems ( to sam-admin) has been less than stellar – SAM team is trying to improve situation  Kin Yip is now SAM shift coordinator  Kin and Wyatt track mail to sam-admin and responses  Shifter training improving

15 A. Lyon (FNAL/D0) – 2002 Future user features  Working on tools for SAM project archeology  Error handling for runProject.py  Your job copies files and your disk fills up: SAM thinks you got all requested files  For now, you should be keeping track of copy or other system command errors

16 A. Lyon (FNAL/D0) – 2002 Sam on Clued0  Installing/Configuring/Testing SAM on clued0 has been a long process. Thanks to Sinisa, Lukas, Dugan, …  Clued0 officially opened for SAM business on December 6, 2002  20 stager nodes (so max 20 SAM jobs)  SAM only available through batch system  Limited to 6 simultaneous downloads from enstore/d0mino ( < 1TB/day)  No stagers in DAB due to network hubs  Clued0 SAM-admins are Adam Lyon & Jon Hays  If problems, send mail to We triage and will forward to sam-admin if we’re stumped.

17 A. Lyon (FNAL/D0) – 2002 Sam on Clued0  Turn on has been fairly smooth:  Stagers now automatically start when node reboots  Fcp daemon death halted deliveries on 12/12 – restarted and all was well  Usual batch system “features”  Make sure you change ‘ central-analysis ’ to ‘ clued0 ’ for station name  Make sure you submit your job from a NFS mountable directory  Don’t specify a destination machine – the batch system will land your job on a SAM stager node  If you kill your batch job, please also sam stop project –-project=yourName_12345

18 A. Lyon (FNAL/D0) – 2002 Sam on Clued0 Stats  From 12/6 – 12/18, 0.158TB (502 files) have been transferred to clued0 station  No one has tried storing files  We’re still small potatoes