Presentation is loading. Please wait.

Presentation is loading. Please wait.

CDF Offline Operations

Similar presentations


Presentation on theme: "CDF Offline Operations"— Presentation transcript:

1 CDF Offline Operations
Status: 5.1.1c running in Production : Remote database/monitor logging turned of Fix in CdfMetModule.cc. Check for multiply deletes. -1 Events gone ! Fixed uninitialised variables in: CprClusterMaker.cc CprWireCollectionMaker.cc

2 5.1.1c_maxopt Got rid of severe error messages in :
PlugStripMaker.cc PlugStripClusterMaker.cc Found infinite loop in KalZ3DVertexFinder.cc  (Kurt and Thorsten) for (unsigned l3=l2+1; l3<l1; ++l3) { double leastdist = 1.0e10; int nearest = -1; for (unsigned int kh=0; kh< layerList[l3].size(); ++kh) hit3 = layerList[l3][kh]; zsearch = hit2->z() + (hit3->r()-hit2->r())* (hit1->z() - hit2->z())/(hit1->r() - hit2->r()); if(fabs(hit3->z() - zsearch)<=leastdist){ leastdist=fabs(hit3->z() - zsearch); nearest=kh; } All other crashes (>95%) duplicate events.

3 Hang and Crash  Bob and Beate
0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510,zCoord= ) at /home/cdfsoft/dist/packages/ElectronObjects/V /src/SimpleExtrapolatedTrack.cc:356 while (_phi > 2.0*M_PI) { _phi -= 2.0*M_PI; } (gdb) where #0 0x8de1be5 in SimpleExtrapolatedTrack::helixZ (this=0xbfff9510, zCoord= ) at /home/cdfsoft/dist/packages/ElectronObjects/V /src/SimpleExtrapolatedTrack.cc:356 #1 0x8ddef11 in SimpleExtrapolatedTrack::extrapolateZ (this=0xbfff9510, zCoord= ) at /home/cdfsoft/dist/packages/ElectronObjects/V /src/SimpleExtrapolatedTrack.cc:204 #2 0x8d9c9db in CdfEmObject::maxPtTrack (this=0xd791d3c__T =0xbfff9ce0) at /home/cdfsoft/dist/packages/ElectronObjects/V0-0070/src/CdfEmObject.cc:767 (gdb) p _phi $1 = e+28  Bob and Beate

4 Valgrind Run valgrind over the other crashes: Other: (Jason)
==18449== Conditional jump or move depends on uninitialised value(s) ==18449== at 0x420A6879: __mktime_internal (in /lib/i686/libc so) ==18449== by 0x420A6EBE: timelocal (in /lib/i686/libc so) ==18449== by 0x9B0D0C1: DateUtil::time_from_string(char const *) (/home/cdfsoft/dist/packages/DBObjects/V /src/TimeStamp.cc:264) ==18449== by 0x904C794: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int) (/home/cdfsoft/dist/packages/TrackingObjects/V /src/ChipStatus.cc:54) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V /src/PedestalUpdator.cc:226) Other: (Jason) ==18449== at 0x904EFBB: ChipStatus::putBit(char *, int, int) (/home/cdfsoft/dist/packages/TrackingObjects/V /src/ChipStatus.cc:133) ==18449== by 0x904F372: ChipStatus::sortBitString(int, int, char *) (/home/cdfsoft/dist/packages/TrackingObjects/V /src/ChipStatus.cc:252) ==18449== by 0x904EC15: ChipStatus::makeMap(int) (/home/cdfsoft/dist/packages/TrackingObjects/V /src/ChipStatus.cc:212) ==18449== by 0x904C8CC: ChipStatus::__ct(std::basic_string<char,std::char_traits<char>,std::allocator<char>>, int ) (/home/cdfsoft/dist/packages/TrackingObjects/V /src/ChipStatus.cc:67) ==18449== by 0x8F94AE5: PedestalUpdator::changed(void) (/home/cdfsoft/dist/packages/SvxDaqObjects/V /src/PedestalUpdator.cc:226)

5 Valgrind Still there (1X) (Aseet)
==6977== Conditional jump or move depends on uninitialised value(s) ==6977== at 0x914484D: PadSqz::Huffman_T::operator<<( (PadSqz::BitStream_T &)) (/home/cdfsoft/dist/packages/PADSObjects/V /src/Huffman.cc:368) ==6977== by 0x9145E4C: PadSqz::PadRawBank::Fluff( (int)) (/home/cdfsoft/dist/packages/PADSObjects/V /src/PadRawBank.cc:173) ==6977== by 0x84CF42C: PadRawModule<PadSqz::COTQ>::event(EventRecord *) (/home/cdfsoft/dist/releases/5.1.1/include/PADSMods/PadRawModule.icc:57)

6 Valgrind Valgrind error in DB ==4539== Invalid read of size 2
==4539== at 0x40705BBC: lxpe2i (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x406F83A5: lxhci2h (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x405E9899: ttclxr (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x403A6217: OCISessionBegin (in /home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==4539== by 0x9B1918B: otl_connect::rlogon(char const *) (/home/cdfsoft/dist/packages/DBObjects/V /src/otl/utilsOTL.cc:420) ==4539== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V /src/otl/dbOTL.cc:328) ==4539== by 0x9AEB5FC: OTLDriverInfo::checkConnection(void) (/home/cdfsoft/dist/packages/CalibDB/V /src/OTL/OTLDriverInfo.cc:95) ==4539== by 0x97C2A39: PASSESOTL::doGet(std::basic_string<char,std::char_traits<char>,std::allocator<char>> const &, std::vector<PASSES,std::allocator<PASSES>> *&) (/home/cdfsoft/dist/releases/5.1.1/tmp/Linux2-KCC_4_0/DBViews/PASSES.OTL.cc:106) ==4539== Address 0x57AFEE62 is 2 bytes after a block of size 200 alloc'd

7 DB Error messages ==19003== 1420 bytes in 5 blocks are still reachable in loss record 76 of 105 ==19003== at 0x40166BA0: malloc (vg_clientfuncs.c:103) ==19003== by 0x4044B13F: ntpaini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4044AFEF: ntgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x40432BEA: nsgblini (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x4035A7DF: kpuatch (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.o.8.0) ==19003== by 0x403A61C7: OCIServerAttach (in/home/cdfsoft/products/oracle_client/v8_1_7_lite/Linux+2/lib/libclntsh.so.8.0) ==19003== by 0x9B18FEF: otl_connect::rlogon(char const *)(/home/cdfsoft/dist/packages/DBObjects/V /src/otl/utilsOTL.cc:367) ==19003== by 0x9B14B12: OTLCon::getConnection(void) (/home/cdfsoft/dist/packages/DBObjects/V /src/otl/dbOTL.cc:328)

8 Daily checking New cron job  checks in log files for severe errors every hour. Found usual problems: %ERLOG-s : *Fluffed bank(s) != original(s) PadRawBanks %ERLOG-s L3 Trigger Bits not in event: no Level3Results or TL3D run = event = %ERLOG-s ROOT/TFile:error writing to file ./JET_CALIB_18651_temp_0 (No space left on device) JET_CALIB:write failed, event not written. %ERLOG-s CalDataMaker: unpack HATD bank : more than 8 hits in WHA (changed TDCs)

9 Memory usage

10 Nodes last week

11 Nodes today

12 Farms Farms are running out of diskspace
Bad for Stream G(13 output streams) compared to C(3 output streams).

13 Farms 10 nodes hangup every day Over 25 over the weekend
Running out of diskspace for concatenation.

14 Production Statistics of reprocessing with EXE: 5.1.1_maxopt
==================================================== To be processed processed last day today total Stream a Stream b Stream c Stream d Stream e Stream g Stream h Stream j Total :

15 History Stream C Stream G

16 Meeting Meeting on Monday with CDF farms
Many ideas to hangups ( No real hint) Power distribution Temperature Network Linux kernel Immediate solution reboot machines automatically Allready monitoring each node every 10 min. Try to get fbs log files

17 Plans Before the end of this week:
 Steve Timm's group will deploy the autoreboot for hanged nodes. This will run once a day, probably at midnight, as a cron job. Suen et al. will figure out how to increase the space available to dfarm.  Steve Timm's group already has implemented a way of saving the CDF code status when a node hangs. I.e. fbsng no longer cleans it all up before we can take a look at it. They will provide CDF with some examples so that we can try to figure out what might trigger this in the CDF software.

18 Plans Farms history: CDF requested a list of dates when significant upgrades to the farms OS (or dfarm) were made. This list should go back to May CDF will try to do a statistical analysis of hangs vs OS etc. A hang is defined as a software failure on OSS's uptime web page information.

19 Plans Early next week, we will add the 3 fileservers fcdfdata053,55,57 to the production farm in order to get more stable operating conditions. The nodes need to be physically moved from FCC1 to FCC2 because of networking issues. Space & power needs to be found. The goal in this is to increase the chances that at least 1 copy of each file in dfarm is always accessible, even if many nodes hang.

20 Data taking Soon new data. Preparing for it. Cosmic runs processed.


Download ppt "CDF Offline Operations"

Similar presentations


Ads by Google