Download presentation
Presentation is loading. Please wait.
Published byJoan Perry Modified over 9 years ago
1
DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary
2
Introduction Started commissioning the Readout Builder at its full size Many people working together to get this done For the first time we have almost two full DAQ Slices to test For now tests are limited to two slices of ~640 PCs (rows A, B, E and F) Still in process of making experience with the installation and maintenance of a cluster of O(1000) PCs Also for the XDAQ software and Run Control it is the first time we work with ~1000 PCs communicating to each other 6-May-2007 Andrea Petrucci - UC San Diego2
3
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego3 SCX layout RU 320 PCs : Row A with 2 rails (ru-c2a[1-4]-[1-20]) Row B with 2 rails (ru-c2b[1-4]-[1-20]) Row E with 2 rails (ru-c2e[1-4]-[1-20]) Row F with 4 rails (ru-c2f[1-4]-[1-20]) BU-FU 320 PCs : Row A with 2 rails (ru-c2a[5-8]-[1-20]) Row B with 2 rails (ru-c2b[5-8]-[1-20]) Row E with 2 rails (ru-c2e[5-8]-[1-20]) Row F with 2 rails (ru-c2f[5-8]-[1-20]) Row A and B are connected to 1 Force10 switch and row E and F to other. F E B A 2 Force 10
4
18-Sep-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego4 Used DummyRUs and ~200 FRLs (Tracker) for testing Different type of trapezoidal configurations: –1 slice with 4 rails (68 DummyRUs x 224 BUs) –1 slice with 4 rails (200 FRLs x 68 RUs x 24 BUs x 672 FUs) close to the final slice –4 slices with 2 rails (per slice: 32 RUs x 47 BUs x 147 FUs ) –8 slices with 2 rails (per slice: 32 RUs x 47 BUs x 147 FUs ) A lot of different activities are going on in parallel: –System and software installation/update –System monitoring optimization –… Testing the first slice –The XDAQ installation is XDAQ build 6 –Monitoring system (slp, sentinel, …) is enabled During last months the system was down many times and it takes some time to set up. SCX Setup
5
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego5 RU Builder Slices
6
DAQ Software Installation All the DAQ software installation is managed by a central Quattor server. Quattor is a system administration toolkit providing a powerful, portable and modular tool suite for the automated installation, configuration and management of clusters and farms running Linux. Quattor allows to re-install a pc in few minutes. There are different Quattor templates for each type of PC: –RU and BUFU PCs –Run Control PCs –FRL and FMM PCs –Etc… All the DAQ software developers had put a lot effort to Quattorize their software (RPM). 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego6
7
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego7 A DAQ Configuration contains One XML configuration file per XDAQ executive –Including Myrinet FED-Builder configuration –including O(100000) I2O connections –Up to several 100 MB of XML Control structure –Hierarchy of function managers –Executives and Applications to be controlled Central DAQ System Currently O(1000) hosts –~10% controlling custom hardware O(10000) XDAQ applications 2 10 7 electronics channels 40 MHz 100 Hz DAQ Configurator
8
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 8 HWCfg Database EQSet FBSet DPSet Hardware Configuration APISoftware Template API Software Template DB RS3 RS API CMS DAQ Configurator SWTemplate GUI Configurator GUI Configurator API Fill DB 4 Manage/create Software Templates 2 Create FEDBuilderSets & DAQPartitionSets 1 Select DAQPartition (Hardware Structure) & Software Template 3 5 Load configuration and configure the system JAVA Fillers DAQ Configurator Data Flow
9
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego9 RCMS is integrated in the general CMS DAQ system, providing control and monitor of the two other components: the DAQ components that have the task to manage the main data flow. They include the Front End Drivers (FED), the Readout Units (RU), the Builder Unit (BU), the Filter Unit (FU), the trigger and data flow control system. the “Detector Control System” DCS, managing the slow controls of the whole experiment The XML data format and the W3C standard SOAP protocol have been adopted as the main means for communication. XDAQ is a C++ framework for a distributed Data Acquisition System, implements: –configuration (parameterization) –communication over multiple network technologies concurrently –high-level provision of system services (memory management, tasks,...) Run Control and Monitor System
10
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego10 –SECURITY SERVICE login and user account management; –RESOURCE SERVICE (RS) information about DAQ resources and partitions; –INFORMATION AND MONITOR SERVICE (IMS) Collects messages and monitor data; distributes them to the subscribers; –JOB CONTROL Starts, monitors and stops the software elements of RCMS, including the DAQ components; RCMS Services
11
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego11 Collects log information from log4j compliant applications (i.e. on-line process). … Publish Subscriber System Storage System Log Collector Relational DB Oracle,MySQL Access via JDBC Access via TCP RCMS applications and XDAQ applications Send log information directly to a Display System (Chainsaw). Stores log information in a database and visualizes them (LogDBViewer). Logging System
12
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego12 Web Browser (GUI) Level 0 FM Level 1 FM Level 2 FM User interaction with Web Browser connected to Level 0 FM. Level 0 FM is entry point to Run Control System. Level 2 FMs are sub-system specific custom implementations. Level 1 FM interface to the Level 0 FM and have to implement a standard set of inputs and states. TOP LTC CSCDAQ RPCDT TRK ECAL HCAL FBRBFF Resources FECFED Resources are on-line system components Function Managers Control Structure
13
Run Control GUIs 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego13 1) RCMS GUI 2) Function Manager Level Zero GUI 3) FED and TTS GUI
14
Tests & Measurements DAQ System 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego14 GOALS Understand problems to run big DAQ system: Reliability, scalability and monitoring system. Measurements: Comprehend if the performances of the system are acceptable. TESTED CONFIGURATIONS Different configurations have been tested: A.68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus. B.68 dummy RUs x 224 Bus x 672 FUs 4 rail from the RUs and 2 rail to the Bus. C.8 Slices with GTPe and ~200 FRLs, per slice: 32 RUs x 47 BUs x 147 FUs (CMSSW locally). D.4 Slices with GTPe and ~100 FRLs, per slice: 32 RUs x 47 BUs x 147 FUs (CMSSW NFS). The test B should perform almost the same as the final slice configuration (72 RU x 288 Bus x 864 FUs) Create, Initialize, Connect, Configure, Get Ready, Start, Stop, Destroy For these tests I create a Java stand-alone application. It controls the Level Zero FM over the following commands:
15
Test A: Only EVB 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego15 CreateInitializeConnectConfigureGet ReadyStartStopDestroy MAX 2,54643,77115,4653,3681,32127,5064,0249,317 MIN 0,61419,9496,8521,7770,79715,9821,4635,779 AVERAGE 1,41926,3618,1531,9780,99117,1682,1086,842 AVEDEV 0,4192,7810,9600,1560,0980,8370,3750,526 N. FAILED00000000 Setup parameters: –Dummy events are created in the BUs in generation mode. –1 Slice with 1x1 FED Builders and events are dropped at BUs. –68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus. –Used row E and F (~ 320 PCs). –Controlled 293 XDAQ executives and 585 XDAQ Applications (ATCPs, EVM, RUs and Bus). –XDAQ Monitor Application enabled. –50 iterations of measurement loop (Create, Initialize, Connect, Configure, Get Ready, Start, Stop and Destroy). Results: –RU Throughput at 16, 32 kByte fragment size: ~480 MB/s.
16
Test B: EVB & Filter Farms 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego16 Setup parameters: –Dummy events are created in the BUs in generation mode. –1 Slice with 1x1 FED Builders and events are dropped at FUs. –68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus. –3 FUs per BU and 1 Storage Manager. –Used row E and F (~ 320 PCs). –Controlled 965 XDAQ executives and 1539 XDAQ Applications (ATCPs, EVM, RUs, BUs, FUResourceBrokers and FUEventProcessors ). –All libraries was loaded from local disk. –XDAQ Monitor Application enabled. –100 iterations of measurements loop (Create, Initialize, Connect, Configure, Get Ready, Start and Destroy). Results: –Could not reach running state because Filter farm applications crashed. CreateInitializeConnectConfigureGet ReadyStartDestroy MAX 48,658132,87517,10717,0892,557Error31,220 MIN 2,40455,0627,33013,1570,881Error21,949 AVERAGE 4,48762,19511,39914,0201,072Error24,675 AVEDEV 2,0115,9472,4360,7580,072Error1,452 N. FAILED01000990
17
Test C: all system with 8 Slices 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego17 CreateInitializeConnectConfigureGet ReadyStartStopDestroy MAX 1,44091,49811,79562,7971,50237,20190,90640,907 MIN 0,40364,6097,66938,9361,15528,61741.79821,154 AVERAGE 0,47571,3408,58942,0281,23531,54746.51725,165 AVEDEV 0,0703,3601,0411,5910,0611.3974.32911,703 N. FAILED0003010300 Setup parameters: –Events are generated in ~200 FRLs and used GTPe. –8 Slice with 8x8 FED Builders and events are sent to the Storage Manager. –2 rail from the RUs and the BUs. –Per Slice: 32 RUs x 47 BUs x 147 FUs. –Used rows A,B, E and F (~ 640 PCs) for Event Builder and Filter Farm. –Controlled 1976 XDAQ executives and 3202 XDAQ Applications (ATCPs, FRLs, EVM, RUs, Bus, FUResourceBrokers, FUEventProcessors and Storage Managers). –XDAQ Monitor Application enabled and all libraries was loaded from local disk. –83 iterations of measurement loop (Create, Initialize, Connect, Configure, Get Ready, Start, Stop and Destroy). Results: –240 MB/s throughput all the way to the Storage Manager disk (event size 480k)
18
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego18 Test D: all System 4 Slices CreateInitializeConnectConfigureGet ReadyStartStopDestroy MAX 5,668174,59211,20333,9583,17628,23441,34561,813 MIN 0,92386,5047,93530,0271,19825,11337,39838,825 AVERAGE 1,463105,8809,59631,0301,41525,92838,42343,572 AVEDEV 0,38418,8960,5660,7650,1510,6200,6413,425 N. FAILED000003280 Setup parameters: –Events are generated in ~100 FRLs and used GTPe. –4 Slice with 4x4 FED Builders and events are sent to the Storage Manager. –2 rail from the RUs and the BUs. –Per Slice: 32 RUs x 47 BUs x 147 FUs. –Used rows E and F (~ 320 PCs) for Event Builder and Filter Farm. –Controlled 988 XDAQ executives and 1601 XDAQ Applications (ATCPs, FRLs, EVM, RUs, Bus, FUResourceBrokers, FUEventProcessors and Storage Managers). –XDAQ Monitor Application enabled and Filter Farm libraries was loaded from NFS. –100 iterations of measurement loop (Create, Initialize, Connect, Configure, Get Ready, Start, Stop and Destroy). Results: –The system is getting slower if we load libraries from NFS and less reliable.
19
Tests summary 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego19 TotalCreateInitializeConnectConfigure Get ReadyStart A Only EVB (~320 PCs) 55,079 1,41926,3618,1531,9780,991 17,168 B EVB+FF (~320 PCs) - 4,48762,19511,39914,0201,072- C 8 slices (~640 PCs) 154,7390,47571,3408,58942,0281,23531,547 D 4 slices NFS (~320 PCs) 175,3121,463105,8809,59631,0301,41525,928 Performance: –Configuration B (close to final slice): –Reasonable time to initialize, connect and configure. –Configuration C: –The system scales well. –Configuration D: –The system loses performance if it loads library from NFS disk ( ~ 2 times slower).
20
Problems during the tests 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego20 Problems observed during the tests: –~15% times the system failed to initialize. The XDAQ executive could not start because the HTTP address was already in use. Also the ATCP application had the same problem. FIXED: It was enough to set the XDAQ HTTP port outside the UNIX Ephemeral port range to solve the problem. –The system could not reach running state because of a fault (segmentation fault) between the communication with BU and FUResourceBroker. FIXED: A bug was found and it is fixed with CMSSW version 2.0.4. –The system gets stuck in configuring state ~5% times. It is reproducible only with big system (8 slices and all rows A,B,E and F). Working in progress: the problem seems to be in the RunControl Framework. –The system fails to start (~5% times) and stop (~40% times). Working in progress: DAQ function managers need to be improved. –The XDAQ monitor system has a latency between 2 or 3 minutes. Working in progress: XDAQ developers are working to improve it.
21
ATCP application 06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego21 Reasonable time to connect all the sockets (max. 15 sec. for 1 slice) Solved the problem of the “address already in use” when starting the listening socket. Created a new HyperDAQ interface: Added “Standard configuration” parameters. Added “debug” page. Integrated to XDAQ monitor system.
22
06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego22 Summary RU Builder Commissioning –First time used a RU Builder configuration almost the same as the final slice It seems to work fine at 20 kHz per slice and a maximum throughput on the RUs of ~480 MB/s –FUs and monitor system applications are included –Reasonable time to initialize and start the system –Some things are not yet understood (ex. fails to start and stop) Main worries are system instabilities –Cooling and its monitoring –Power cuts –Quattor installation –System configuration –Difficulties issuing the commands on many PCs at the same time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.