CDF Offline Operations

Slides:



Advertisements
Similar presentations
Memory and Files Dr. Andrew Wallace PhD BEng(hons) EurIng
Advertisements

Blackfin BF533 EZ-KIT Control The O in I/O Activating a FLASH memory “output line” Part 3 – New instruction recap Tutorial.
Recovery CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems by Connolly & Begg, © Addison Wesley 2002)
DEBUGGING IN THE REAL WORLD : Recitation 4.
Processes CSCI 444/544 Operating Systems Fall 2008.
CSE 303 Lecture 13a Debugging C programs
Week 10 Project 3: An Introduction to File Systems
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Bit Operations Horton pp Why we need to work with bits Sometimes one bit is enough to store your data: say the gender of the student (e.g. 0.
Computer Systems Week 14: Memory Management Amanda Oddie.
Sep 13, 2006 Scientific Computing 1 Managing Scientific Computing Projects Erik Deumens QTP and HPC Center.
Topics memory alignment and structures typedef for struct names bitwise & for viewing bits malloc and free (dynamic storage in C) new and delete (dynamic.
NOVA art. memory leaking Alexey Naumov Lebedev Physical Institute Moscow 1.
Sairajiv Burugapalli. This chapter covers three main categories of classic software vulnerability: Buffer overflows Integer vulnerabilities Format string.
Cho, Ho-Gi OS Lab. Cho, Ho-Gi OS Lab. How to Shadow Every Byte of Memory Used by a Program
1 Debugging (Part 2). “Programming in the Large” Steps Design & Implement Program & programming style (done) Common data structures and algorithms Modularity.
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
1 C Basics Monday, August 30, 2010 CS 241. Announcements MP1, a short machine problem, will be released today. Due: Tuesday, Sept. 7 th at 11:59pm via.
CSE 333 – SECTION 2 Memory Management. Questions, Comments, Concerns Do you have any? Exercises going ok? Lectures make sense? Homework 1 – START EARLY!
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
THIS MORNING (Start an) informal discussion to -Clearly identify all open issues, categorize them and build an action plan -Possibly identify (new) contributing.
Windows 10 Common VPN Error Tech Support Number
Input/Output (I/O) Important OS function – control I/O
Welcome POS Synchronize Concept 08 Sept 2015.
C Programming Types & Dynamic memory & bits & others
Mobile Testing - Bug Report
Protecting Memory What is there to protect in memory?
Jonathan Walpole Computer Science Portland State University
14 Compilers, Interpreters and Debuggers
AI How to: System Update and Additional Software
CSE451 Memory Management Continued Autumn 2002
CSE 374 Programming Concepts & Tools
Protecting Memory What is there to protect in memory?
Protecting Memory What is there to protect in memory?
Valgrind Overview What is Valgrind?
Technical Board Meeting, CNAF, 14 Feb. 2004
Pointers & Dynamic Memory
CDF Offline Operations
CSE 374 Programming Concepts & Tools
Operating System I/O System Monday, August 11, 2008.
MONITORING MICROSOFT WINDOWS SERVER 2003
Bit Operations Horton pp
Multiprocessor Cache Coherency
Offline shifter training tutorial
How to Fix the Automatic Repair Loop in Windows 8.1
Analysis Operations Requirements
CSE 153 Design of Operating Systems Winter 2018
CMSC621: Advanced Operating Systems Advanced Operating Systems
Disk Storage, Basic File Structures, and Buffer Management
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Operating Systems Chapter 5: Input/Output Management
CS703 - Advanced Operating Systems
Putting the I in IoT.
INFO 344 Web Tools And Development
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Prof. Leonardo Mostarda University of Camerino
CST-115 Introduction to Computer Programming
COP 3330 Object-oriented Programming in C++
Blackfin BF533 EZ-KIT Control The O in I/O
THE GOOGLE FILE SYSTEM.
Unit 3: Variables in Java
The Troubleshooting theory
CSE 153 Design of Operating Systems Winter 2019
File System Performance
Getting Started Download the tarball for this session. It will include the following files: driver 64-bit executable driver.c C driver source bomb.h declaration.
Valgrind Overview What is Valgrind?
Last Class: Communication in Distributed Systems
Week 7 - Friday CS222.
Bit Operations Horton pp
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

CDF Offline Operations Status: 5.1.1c running in Production : Remote database/monitor logging turned of Fix in CdfMetModule.cc. Check for multiply deletes. -1 Events gone ! Fixed uninitialised variables in: CprClusterMaker.cc CprWireCollectionMaker.cc

5.1.1c_maxopt Got rid of severe error messages in : PlugStripMaker.cc PlugStripClusterMaker.cc Found infinite loop in KalZ3DVertexFinder.cc  (Kurt and Thorsten) for (unsigned l3=l2+1; l3<l1; ++l3) { double leastdist = 1.0e10; int nearest = -1; for (unsigned int kh=0; kh< layerList[l3].size(); ++kh) hit3 = layerList[l3][kh]; zsearch = hit2->z() + (hit3->r()-hit2->r())* (hit1->z() - hit2->z())/(hit1->r() - hit2->r()); if(fabs(hit3->z() - zsearch)<=leastdist){ leastdist=fabs(hit3->z() - zsearch); nearest=kh; } All other crashes (>95%) duplicate events.

Hang and Crash  Bob and Beate 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510,zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 356 while (_phi > 2.0*M_PI) { _phi -= 2.0*M_PI; } (gdb) where #0 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 #1 0x8ddef11 in SimpleExtrapolatedTrack::extrapolateZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:204 #2 0x8d9c9db in CdfEmObject::maxPtTrack (this=0xd791d3c__T165106692=0xbfff9ce0) at /home/cdfsoft/dist/packages/ElectronObjects/V0-0070/src/CdfEmObject.cc:767 (gdb) p _phi $1 = 6.7514747645567823e+28  Bob and Beate

Valgrind Run valgrind over the other crashes: Other: (Jason) ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x420A6879: __mktime_internal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x420A6EBE: timelocal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x9B0D0C1: DateUtil::time_from_string(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/TimeStamp.cc:264) ==18449== by 0x904C794: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:54) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-0074/src/PedestalUpdator.cc:226) Other: (Jason) ==18449== at 0x904EFBB: ChipStatus::putBit(char *, int, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:133) ==18449== by 0x904F372: ChipStatus::sortBitString(int, int, char *) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:252) ==18449== by 0x904EC15: ChipStatus::makeMap(int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:212) ==18449== by 0x904C8CC: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int ) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:67) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-00-74/src/PedestalUpdator.cc:226)

Valgrind Still there (1X) (Aseet) ==6977== Conditional jump or move depends on uninitialised value(s) ==6977== at 0x914484D: PadSqz::Huffman_T::operator<<( (PadSqz::BitStream_T &)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/Huffman.cc:368) ==6977== by 0x9145E4C: PadSqz::PadRawBank::Fluff( (int)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/PadRawBank.cc:173) ==6977== by 0x84CF42C: PadRawModule<PadSqz::COTQ>::event(EventRecord *) (/home/cdfsoft/dist/releases/5.1.1/include/PADSMods/PadRawModule.icc:57)

Valgrind Valgrind error in DB ==4539== Invalid read of size 2 ==4539== at 0x40705BBC: lxpe2i (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x406F83A5: lxhci2h (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x405E9899: ttclxr (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x403A6217: OCISessionBegin (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x9B1918B: otl_connect::rlogon(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/utilsOTL.cc:420) ==4539== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328) ==4539== by 0x9AEB5FC: OTLDriverInfo::checkConnection(void) (/home/cdfsoft/dist/packages/CalibDB/V00-00-85/src/OTL/OTLDriverInfo.cc:95) ==4539== by 0x97C2A39: PASSESOTL::doGet(std::basic_string<char,std::char_traits<char>,std::allocator<char>> const &, std::vector<PASSES,std::allocator<PASSES>> *&) (/home/cdfsoft/dist/releases/5.1.1/tmp/Linux2-KCC_4_0/DBViews/PASSES.OTL.cc:106) ==4539== Address 0x57AFEE62 is 2 bytes after a block of size 200 alloc'd

DB Error messages ==19003== 1420 bytes in 5 blocks are still reachable in loss record 76 of 105 ==19003== at 0x40166BA0: malloc (vg_clientfuncs.c:103) ==19003== by 0x4044B13F: ntpaini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4044AFEF: ntgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x40432BEA: nsgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4035A7DF: kpuatch (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x403A61C7: OCIServerAttach (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x9B18FEF: otl_connect::rlogon(char const *)(/home/cdfsoft/dist/packages/DBObjects/V00-0072/src/otl/utilsOTL.cc:367) ==19003== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328)

Daily checking New cron job  checks in log files for severe errors every hour. Found usual problems: %ERLOG-s : *Fluffed bank(s) != original(s) PadRawBanks %ERLOG-s L3 Trigger Bits not in event: no Level3Results or TL3D run = 159288 event = 1033557 %ERLOG-s ROOT/TFile:error writing to file ./JET_CALIB_18651_temp_0 (No space left on device) JET_CALIB:write failed, event not written. %ERLOG-s CalDataMaker: unpack HATD bank : more than 8 hits in WHA (changed TDCs)

Memory usage

Nodes last week

Nodes today

Farms Farms are running out of diskspace Bad for Stream G(13 output streams) compared to C(3 output streams).

Farms 10 nodes hangup every day Over 25 over the weekend Running out of diskspace for concatenation.

Production Statistics of reprocessing with EXE: 5.1.1_maxopt ==================================================== To be processed processed last day today total Stream a 20521173 0 0 0 Stream b 80915268 0 0 0 Stream c 57487182 0 0 57180498 Stream d 35100306 0 0 0 Stream e 67452861 0 0 0 Stream g 101170413 4674100 1813007 78111329 Stream h 155508683 0 0 0 Stream j 70459709 0 0 0 --------------------------------------------------------------------------------------------- Total : 588615595 4674100 1813007 135291827

History Stream C Stream G

Meeting Meeting on Monday with CDF farms Many ideas to hangups ( No real hint) Power distribution Temperature Network Linux kernel … Immediate solution reboot machines automatically Allready monitoring each node every 10 min. Try to get fbs log files

Plans Before the end of this week:  Steve Timm's group will deploy the autoreboot for hanged nodes. This will run once a day, probably at midnight, as a cron job. Suen et al. will figure out how to increase the space available to dfarm.  Steve Timm's group already has implemented a way of saving the CDF code status when a node hangs. I.e. fbsng no longer cleans it all up before we can take a look at it. They will provide CDF with some examples so that we can try to figure out what might trigger this in the CDF software.

Plans Farms history: CDF requested a list of dates when significant upgrades to the farms OS (or dfarm) were made. This list should go back to May 2003. CDF will try to do a statistical analysis of hangs vs OS etc. A hang is defined as a software failure on OSS's uptime web page information.

Plans Early next week, we will add the 3 fileservers fcdfdata053,55,57 to the production farm in order to get more stable operating conditions. The nodes need to be physically moved from FCC1 to FCC2 because of networking issues. Space & power needs to be found. The goal in this is to increase the chances that at least 1 copy of each file in dfarm is always accessible, even if many nodes hang.

Data taking Soon new data. Preparing for it. Cosmic runs processed.