Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA HEPiX – October, 2004
Central Computing –Distracted by SPAM problem –Evaluated and purchased MXLogic Offsite solution Filters virus/spam before getting to Lab –Upgraded our hardware Windows builds –Purchased MS Enterprise Agreement –Developed an automatic build process –Upgrading all of our systems to Windows XP –Still evaluating SP2, problems with CAD, etc.
File Server Storage Adaptec 2200S Raid and Linux XFS –Linux kernel 2.6 and Adaptec firmware (build 7244) It doesn’t work (I/O errors, etc.) –RedHat EL3 WS kernel works fine, but no XFS support –Tested ext3 performance unacceptable (20MB/s read, 34MB/s write) XFS performance (approx 100MB/s read/write) –Dropped back to prior Adaptec BIOS and 2.6 kernel works fine
File Server Storage (cont) Purchased 2 StorageTek B280 systems –14 TB of disk space –4 Sun V210 head units –Stable, but slow, NFS performance Aggregate -- 6MB/s write, 63MB/s read Each node MB/s write, 1.4MB/s read average
File Server Storage (cont) Evaluating 10TB Panasas system –Tested 2 protocols (directFLOW and NFS) –No directFLOW problems –NFS finally stable at version 2.1.4c –Good performance with either Aggregate MB/s write, MB/s read Each node – MB/s write, MB/s read
Jasmine Changes Jasmine is Jlab’s mass storage system (disk+tape) stores ~1PB and can routinely move 20TB/day. Disk cache system recently rewritten for performance and reliability –I/O load spread out over pool of many disk servers –Files belong to file groups (per experiment) with quotas –Quotas may be exceeded if there is enough disk space; allows more flexible use of disk –Files deleted from servers in a modified LRU fashion –Files may be pinned until used by the batch farm
Jasmine changes (2) New programmatic interfaces for –Batch Farm (Auger) –Other services that need to move files (SRM, DAQ, LQCD disk cache) More reliance on MySQL database; concurrency and load are challenging Writing 9940B tapes Experiment data rates now ~30MB/sec
Auger Changes Auger is Jlab’s Batch farm management system. Uses LSF to run jobs, keeps accounting in a database for web or command line presentation. Users can submit thousands of jobs using a compact job description that includes file retrieval and storage. Interfaces with Jasmine to stage files to disk before the job runs on the farm to keep CPUs busy
Jasmine & Auger Web Interface Java Server Pages
Projects upgrade –Still evaluating software/hardware Desktop systems –MacOS-X –Linux, Unix –Windows Power/Cooling issues –Reached limit of current Computer Room –New Computer Center to open in Jan 2006 –Increased power requirements for 800 MHz FSB systems 1.3A to 2.1A (single CPU) 1.6A to 2.8A (dual CPU) –Shutdown problems with non-ACPI enabled systems