Compute and Storage For the Farm at Jlab

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

Operating System.
SUMS Storage Requirement 250 TB fixed disk cache 130 TB annual increment for permanently on- line data 100 TB work area (not controlled by SUMS) 2 PB near-line.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
The Mass Storage System at JLAB - Today and Tomorrow Andy Kowalski.
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Write-through Cache System Policies discussion and A introduction to the system.
Integrating JASMine and Auger Sandy Philpott Thomas Jefferson National Accelerator Facility Jefferson Ave. Newport News, Virginia USA 23606
HPCC Mid-Morning Break File System Tips & Tricks.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
Disk Farms at Jefferson Lab Bryan Hess
Computer Software Types Three layers of software Operation.
Application Software System Software.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
A computer contains two major sets of tools, software and hardware. Software is generally divided into Systems software and Applications software. Systems.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
© 2012 IBM Corporation IBM Linear Tape File System (LTFS) Overview and Demo.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility Newport News, Virginia USA
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
Advanced Computing Facility Introduction
Getting the Most out of Scientific Computing Resources
DIT314 ~ Client Operating System & Administration
Welcome to Indiana University Clusters
Getting the Most out of Scientific Computing Resources
Simulation Production System
Lecture 1: Operating System Services
Processes and threads.
WP18, High-speed data recording Krzysztof Wrona, European XFEL
XNAT at Scale June 7, 2016.
2. OPERATING SYSTEM 2.1 Operating System Function
JLab Auger Auger is the interface to JLab’s data analysis cluster (“the farm”) Controls batch job submissions Manages input/output from jobs Provides details.
Operating System.
Cross Platform Development using Software Matrix
Spark Presentation.
Technology Literacy Hardware.
Architecture & System Overview
OGSA Data Architecture Scenarios
Computing Infrastructure for DAQ, DM and SC
Operating Systems What are they and why do we need them?
Welcome to our Nuclear Physics Computing System
XenData SX-550 LTO Archive Servers
MODERN OPERATING SYSTEMS Third Edition ANDREW S
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Welcome to our Nuclear Physics Computing System
Michael P. McCumber Task Force Meeting April 3, 2006
Chapter 2: Operating-System Structures
Introduction to High Performance Computing Using Sapelo2 at GACRC
Data Management Components for a Research Data Archive
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Chapter 2: Operating-System Structures
Jefferson Lab Scientific Computing Update
Presentation transcript:

Compute and Storage For the Farm at Jlab Kurt J. Strosahl Scientific Computing, Thomas Jefferson National Accelerator Facility

Overview Compute nodes to process data and run simulations Single and multi-thread capable Linux (CentOS) On disk space and tape storage Disk space for immediate access Tape space for long term storage Interactive nodes Code testing  Review of data Infiniband network Provides high speed data transfer between nodes and disk space

Compute nodes Run CentOS Linux (based off of Red Hat) CentOS 7.2 primarily (10,224 total job slots) CentOS 6.5 being phased out (544 total job slots) Many software libraries available Python Java Gcc fortran Span several generations of hardware Newer systems have more memory, larger local disks, faster processors

File Systems CNI provided Cache - write through cache /group - a space for groups to put software and some files, backup up by CNI /home - your home directory, backed up by CNI Cache - write through cache Volatile - acts as a scratch space Work – unmanaged outside of quotas / reservations

/cache A write-through cache (files get written to tape) 450 TB of total space Resides on the Lustre file system Usage is managed by software Quota: Limit that can be reached when luster usage on a whole is low Reservation: Space that will always be available Groups are allowed to exceed these limits if other groups are under their limits.

/Volatile 200TB of space total Uses a separate set of quotas and reservations Files are not automatically moved to tape Files are deleted if they have not been accessed in six months When a project reaches its quota files are deleted based on a combination of age and use (oldest, least used first). If lustre use is critical files are deleted using the same 

/work Exists on both the lustre file system and independent file servers (zfs) Has quotas and reservations Is otherwise managed by users Non-lustre work space is for operations that are heavy in small i/o (code compilation), databases, and operations that require file locking. Is not backed up automatically Can be backed up manually using the jasmine commands

Lustre file system A clustered, parallel, file system Clustered: Many file servers are grouped together to provide one "namespace" Parallel: Each file server performs its own reads and writes to clients A metadata system keeps track of file locations, coordinates Luster has a quota system, but it applies to the whole lustre file system and not subdirectories. You can check this quota using: lfs quota –g <group name> This quota is a sum of all quotas (volatile, cache, work) and serves as a final break against a project overrunning lustre Best performance comes from large reads and writes Uses ZFS to provide data integrity Copy on Write (COW) Data is checksummed as it is written, and that checksum is then stored for later verification

Tape Storage IBM tape library, with LTO4 to LTO6 tapes Long term storage Is accessed by bringing files from tape to disk storage Jasmine commands request specific files to be sent or retrieved from tape Can be accessed through auger and SWIF submission scripts  Maximum file size is 20GB

Auger, or how you use the batch farm Auger provides an intelligent system to coordinate data and jobs. Takes either plain text or xml input files describe what files are needed for a job What resources a job is going to need Memory, number of processors (threads), on-node disk space, time needed What to do with output files Full details of the auger files, and the options available can be found at https://scicomp.jlab.org/docs/auger_command_files

Links https://scicomp.jlab.org/ System status Where your jobs are Status of tape requests Important news items https://scicomp.jlab.org/docs/FarmUsersGuide Documentation of commands available