CDF Offline Operations Status: 5.1.1c running in Production : Remote database/monitor logging turned of Fix in CdfMetModule.cc. Check for multiply deletes. -1 Events gone ! Fixed uninitialised variables in: CprClusterMaker.cc CprWireCollectionMaker.cc
5.1.1c_maxopt Got rid of severe error messages in : PlugStripMaker.cc PlugStripClusterMaker.cc Found infinite loop in KalZ3DVertexFinder.cc (Kurt and Thorsten) for (unsigned l3=l2+1; l3<l1; ++l3) { double leastdist = 1.0e10; int nearest = -1; for (unsigned int kh=0; kh< layerList[l3].size(); ++kh) hit3 = layerList[l3][kh]; zsearch = hit2->z() + (hit3->r()-hit2->r())* (hit1->z() - hit2->z())/(hit1->r() - hit2->r()); if(fabs(hit3->z() - zsearch)<=leastdist){ leastdist=fabs(hit3->z() - zsearch); nearest=kh; } All other crashes (>95%) duplicate events.
Hang and Crash Bob and Beate 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510,zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 356 while (_phi > 2.0*M_PI) { _phi -= 2.0*M_PI; } (gdb) where #0 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:356 #1 0x8ddef11 in SimpleExtrapolatedTrack::extrapolateZ (this=0xbfff9510, zCoord=185.39999389648438) at /home/cdfsoft/dist/packages/ElectronObjects/V00-00-70/src/SimpleExtrapolatedTrack.cc:204 #2 0x8d9c9db in CdfEmObject::maxPtTrack (this=0xd791d3c__T165106692=0xbfff9ce0) at /home/cdfsoft/dist/packages/ElectronObjects/V0-0070/src/CdfEmObject.cc:767 (gdb) p _phi $1 = 6.7514747645567823e+28 Bob and Beate
Valgrind Run valgrind over the other crashes: Other: (Jason) ==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x420A6879: __mktime_internal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x420A6EBE: timelocal (in /lib/i686/libc-2.2.5.so) ==18449== by 0x9B0D0C1: DateUtil::time_from_string(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/TimeStamp.cc:264) ==18449== by 0x904C794: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:54) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-0074/src/PedestalUpdator.cc:226) Other: (Jason) ==18449== at 0x904EFBB: ChipStatus::putBit(char *, int, int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:133) ==18449== by 0x904F372: ChipStatus::sortBitString(int, int, char *) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:252) ==18449== by 0x904EC15: ChipStatus::makeMap(int) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:212) ==18449== by 0x904C8CC: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int ) (/home/cdfsoft/dist/packages/TrackingObjects/V00-01-73/src/ChipStatus.cc:67) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V00-00-74/src/PedestalUpdator.cc:226)
Valgrind Still there (1X) (Aseet) ==6977== Conditional jump or move depends on uninitialised value(s) ==6977== at 0x914484D: PadSqz::Huffman_T::operator<<( (PadSqz::BitStream_T &)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/Huffman.cc:368) ==6977== by 0x9145E4C: PadSqz::PadRawBank::Fluff( (int)) (/home/cdfsoft/dist/packages/PADSObjects/V00-00-23/src/PadRawBank.cc:173) ==6977== by 0x84CF42C: PadRawModule<PadSqz::COTQ>::event(EventRecord *) (/home/cdfsoft/dist/releases/5.1.1/include/PADSMods/PadRawModule.icc:57)
Valgrind Valgrind error in DB ==4539== Invalid read of size 2 ==4539== at 0x40705BBC: lxpe2i (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x406F83A5: lxhci2h (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x405E9899: ttclxr (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x403A6217: OCISessionBegin (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x9B1918B: otl_connect::rlogon(char const *) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/utilsOTL.cc:420) ==4539== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328) ==4539== by 0x9AEB5FC: OTLDriverInfo::checkConnection(void) (/home/cdfsoft/dist/packages/CalibDB/V00-00-85/src/OTL/OTLDriverInfo.cc:95) ==4539== by 0x97C2A39: PASSESOTL::doGet(std::basic_string<char,std::char_traits<char>,std::allocator<char>> const &, std::vector<PASSES,std::allocator<PASSES>> *&) (/home/cdfsoft/dist/releases/5.1.1/tmp/Linux2-KCC_4_0/DBViews/PASSES.OTL.cc:106) ==4539== Address 0x57AFEE62 is 2 bytes after a block of size 200 alloc'd
DB Error messages ==19003== 1420 bytes in 5 blocks are still reachable in loss record 76 of 105 ==19003== at 0x40166BA0: malloc (vg_clientfuncs.c:103) ==19003== by 0x4044B13F: ntpaini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4044AFEF: ntgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x40432BEA: nsgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4035A7DF: kpuatch (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x403A61C7: OCIServerAttach (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x9B18FEF: otl_connect::rlogon(char const *)(/home/cdfsoft/dist/packages/DBObjects/V00-0072/src/otl/utilsOTL.cc:367) ==19003== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V00-00-72/src/otl/dbOTL.cc:328)
Daily checking New cron job checks in log files for severe errors every hour. Found usual problems: %ERLOG-s : *Fluffed bank(s) != original(s) PadRawBanks %ERLOG-s L3 Trigger Bits not in event: no Level3Results or TL3D run = 159288 event = 1033557 %ERLOG-s ROOT/TFile:error writing to file ./JET_CALIB_18651_temp_0 (No space left on device) JET_CALIB:write failed, event not written. %ERLOG-s CalDataMaker: unpack HATD bank : more than 8 hits in WHA (changed TDCs)
Memory usage
Nodes last week
Nodes today
Farms Farms are running out of diskspace Bad for Stream G(13 output streams) compared to C(3 output streams).
Farms 10 nodes hangup every day Over 25 over the weekend Running out of diskspace for concatenation.
Production Statistics of reprocessing with EXE: 5.1.1_maxopt ==================================================== To be processed processed last day today total Stream a 20521173 0 0 0 Stream b 80915268 0 0 0 Stream c 57487182 0 0 57180498 Stream d 35100306 0 0 0 Stream e 67452861 0 0 0 Stream g 101170413 4674100 1813007 78111329 Stream h 155508683 0 0 0 Stream j 70459709 0 0 0 --------------------------------------------------------------------------------------------- Total : 588615595 4674100 1813007 135291827
History Stream C Stream G
Meeting Meeting on Monday with CDF farms Many ideas to hangups ( No real hint) Power distribution Temperature Network Linux kernel … Immediate solution reboot machines automatically Allready monitoring each node every 10 min. Try to get fbs log files
Plans Before the end of this week: Steve Timm's group will deploy the autoreboot for hanged nodes. This will run once a day, probably at midnight, as a cron job. Suen et al. will figure out how to increase the space available to dfarm. Steve Timm's group already has implemented a way of saving the CDF code status when a node hangs. I.e. fbsng no longer cleans it all up before we can take a look at it. They will provide CDF with some examples so that we can try to figure out what might trigger this in the CDF software.
Plans Farms history: CDF requested a list of dates when significant upgrades to the farms OS (or dfarm) were made. This list should go back to May 2003. CDF will try to do a statistical analysis of hangs vs OS etc. A hang is defined as a software failure on OSS's uptime web page information.
Plans Early next week, we will add the 3 fileservers fcdfdata053,55,57 to the production farm in order to get more stable operating conditions. The nodes need to be physically moved from FCC1 to FCC2 because of networking issues. Space & power needs to be found. The goal in this is to increase the chances that at least 1 copy of each file in dfarm is always accessible, even if many nodes hang.
Data taking Soon new data. Preparing for it. Cosmic runs processed.