Composition and Operation of a Tier-3 cluster on the Open Science Grid

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

Computer networks Fundamentals of Information Technology Session 6.
Florida Tech Grid Cluster P. Ford 2 * X. Fave 1 * M. Hohlmann 1 High Energy Physics Group 1 Department of Physics and Space Sciences 2 Department of Electrical.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
VTS INNOVATOR SERIES Real Problems, Real solutions.
Computer Networks IGCSE ICT Section 4.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Chapter 7: Using Windows Servers to Share Information.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
ITGS Networks. ITGS Networks and components –Server computers normally have a higher specification than regular desktop computers because they must deal.
1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept,
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
Network - definition A network is defined as a collection of computers and peripheral devices (such as printers) connected together. A local area network.
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
Storage HDD, SSD and RAID.
Chapter 7: Using Windows Servers
Dynamic Extension of the INFN Tier-1 on external resources
Large Output and Shared File Systems
Computing Clusters, Grids and Clouds Globus data service
Introduction to Distributed Platforms
Belle II Physics Analysis Center at TIFR
Installing and Running a CMS T3 Using OSG Software - UCR
SAM at CCIN2P3 configuration issues
LCGAA nightlies infrastructure
3.1 Types of Servers.
ATLAS Sites Jamboree, CERN January, 2017
A Messaging Infrastructure for WLCG
THE STEPS TO MANAGE THE GRID
Network Requirements Javier Orellana
LQCD Computing Operations
Ákos Frohner EGEE'08 September 2008
Cloud based Open Source Backup/Restore Tool
Mario Reale – IGI / GARR Lyon, Sept 19, 2011
US CMS Testbed.
Storage Virtualization
Overview Introduction VPS Understanding VPS Architecture
Networking for Home and Small Businesses – Chapter 2
Patrick Dreher Research Scientist & Associate Director
Windows Azure 講師: 李智樺, Ruddy Lee
Lecture-5 Implementation of Information System Part - I Thepul Ginige
Design Unit 26 Design a small or home office network
Chapter 2: Operating-System Structures
Florida Tech Grid Cluster
Networking for Home and Small Businesses – Chapter 2
Chapter 2: Operating-System Structures
Presentation transcript:

Composition and Operation of a Tier-3 cluster on the Open Science Grid Florida Academy of Sciences 81st Annual Conference 9-10 March R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann Department of Physics and Space Sciences Florida Institute of Technology

Definition of a Cluster What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Several independent computers are connected together and acting as one One head node through which traffic flows Worker nodes and storage bays are connected through a private, local network Several independent computers are connected together and acting as one One head node through which traffic flows Routes jobs to worker nodes NASs are mounted on it All components connected through local network March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 2

Benefits of Cluster Computing What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Scalability Can add computing power as needed Redundancy If one node fails, the others can carry the load Flexibility Resources can be reallocated at will Reliability Components act independently of one another Scalability Can add computing power as needed Easy to upgrade Redundancy If one node fails, the others can carry the load A cluster can still be operational with several nodes offline Several points of acceptable failure Flexibility Resources can be reallocated at will Nodes can be assigned to different tasks on the fly EX: our cluster Most of the time, the nodes are reserved to compute incoming grid jobs Sometimes, nodes are reserved for local computation Reliability Components act independently of one another Redundancy improves reliability Cluster can stay online (with reduced performance) for almost any number of node failures March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 3

Open Science Grid (OSG) What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Provides computing resources to scientists working on a wide variety of projects Users submit their jobs to the grid, and the grid sends the job to one of the OSG sites Provides computing resources to scientists working on a wide variety of projects Such as: CMS (our affiliation) ATLAS Users submit jobs to grid, then grid sends job to an OSG site March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 4

Tiers of OSG Computing What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Tier 0 CERN Just one site Large data center Services Tier 1 sites Tier 1 National Centers 7 sites EX: Fermilab (Illinois) Support Tier 2 sites Tier 2 Regional Centers 53 sites EX: Closest is at University of Florida Support Tier 3 sites Tier 3 64 smaller sites Perform less computations Not always running FIT is Tier 3 March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 5

United States Tier 3 OSG Sites What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Florida Institute of Technology Melbourne, FL Very many sites Not all functioning March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 6

CMS Project at CERN Compact Muon Solenoid What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Compact Muon Solenoid General detector at Large Hadron Collider (LHC) Has a broad field of uses Studying standard model of particle physics Finding particles that make up dark matter CMS is a major user of the grid and our cluster. CMS = Compact Muon Solenoid General detector at Large Hadron Collider (LHC) Broad field of uses Studying standard model Find dark matter particles CMS is major user of OSG grid and our cluster CMS research at FIT March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 7

Providing High Throughput Computing to All What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? High throughput computing Large amount of computing over long periods of time very expensive and time consuming Not viable for individual researchers to maintain their own computing resources Open Science Grid connects researchers with computing resources Can provide computing resources to groups on campus, should they need it. HT computing expensive and time consuming to maintain High-Performance Large amount of computing power need for short amount of time (hours/days) High-Throughput Large amounts of computing over long periods of time (months/years) Our cluster is high-throughput Not viable for individual researchers to maintain their own computing resources OSG connects researchers with computing resources Can provide computing resources to researchers on campus should they need it March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 8

Specifications of FIT’s Cluster What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Computing Power 160 2.33GHz cores across 20 nodes Storage Capacity 10TB of user storage (NAS-0) 51TB of data storage (NAS-1) LAN Speed: 1 Gb/s Job Computation About 680 jobs a day About 108 computing seconds every real-time second Computing Power 160 2.33GHz cores across 20 nodes One 8-core processor per node Storage Capacity 51TB of data storage (NAS-1) 10TB of user storage (NAS-0) Speed at which the components of the cluster communicate is 1 Gb/s Job Computation Average of 680 jobs per day Average 108 computing seconds per second Processing is done on many different cores on the same time On average 108 of the 160 cores are processing data simultaneously March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 9

2017 Jobs Completed What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Number of jobs per day inconsistant Glow user is Grid Laboratory Of Wisconsin March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 10

2017 Total Hours Completed What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 11

What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Which software is installed on which parts We will look at each part March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 12

Rocks, Nodes, & NAS What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Rocks Makes the various components work as a single unit Nodes 20 individual nodes which do actual job computation NAS (Network Attached Storage) NAS-0 RAID 6, 10 TB Contains user directories Mounted on CE and compute nodes NAS-1 RAID 60, 51 TB Storage unit, contains most big files Rocks Cluster building software onto which any OS can be installed Default is CentOS Nodes Perform actual job computation CE routes jobs to nodes Network Attached Storage Rather than store data on own drives, nodes store data on large storage unit Data accessible from every part of the cluster NAS-0 10TB RAID 6 16 750GB drives Size decrease due to RAID 2 parity strips allow for 2 drive failures without data loss Stores user home directories Mounted on CE and compute nodes One of the most important parts of cluster NAS-1 51TB RAID 60 4 RAID 6 groups of 9 drives each Greater speed and reliability than RAID 6 Up to 10 drives can fail without data loss Primary storage for cluster Local computational data Computer backups March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 13

Compute Element (CE) Globus Gatekeeper Listens to the port Squid What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Globus Gatekeeper Listens to the port Squid HTTP proxy Optimizes traffic, transfers data GUMS (Grid User Management Service) Authenticates outside users Maps them to a local account Tomcat Powers GUMS Manages cluster by using several different software Globus Gatekeeper software Listens to ports and allows only properly authenticated users access Squid Enhances performance of HTTP server Caches copies of often-requested documents Reduces bandwidth consumption GUMS (Grid User Management Service) Very important part of the grid-computing aspect of our cluster Allows grid users access to the cluster Authenticates outside users map outside users to a local account Tomcat Apache tomcat powers webservices Like GUMS March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 14

Compute Element (CE) GridFTP Transfers data Works with Bestman What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? GridFTP Transfers data Works with Bestman HTCondor (High Throughput Computing) Manages & schedules jobs RSV (Resource Service Validation) Tests services Ganglia Monitors cluster CRAB (CMS Remote Analysis Builder) Remote CMS jobs created by CRAB are sent to cluster GridFTP Extension of FTP (File Transfer Protocol) specifically for grid computing Used to transfer data over the grid Works with Bestman (talked about later) HTConder High-Throughput Condor Designed for High-Throughput computing rather than High-Performance computing Manages the cluster’s jobs Queues jobs Assigns jobs to nodes Integral part to the operation of the cluster RSV (Resource Service Validation) Tests the various services on the cluster to ensure that they are operational Certificate tests Gridftp Java Job management Run by OSG Ganglia Monitoring software Keeps track of: How many jobs Computing hours Performance of nodes CRAB (CMS Remote Analysis Builder Remote CMS jobs created by CRAB sent to cluster March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 15

Storage Element (SE) XrootD What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? XrootD Gets & distributes files from the Open Science Grid PhEDEx (Physics Experiment Data Export) Transfer management database BeStMan (Berkley Storage Manager) Allows for communication between local storage and the grid Being replaced by HDFS (Hadoop Distributed FIle System) Externally addressable computer separate from CE that specifically manages data transfer XrootD Manages access to grid file repositories Allows cluster to obtain files from OSG PhEDEx (Physics Experiment Data Export) What the CMS project at CERN uses to transfer data Manages movement of files between computing sites BeStMan (Berkley Storage Manager) Allows for communication between local storage and the rest of the grid Soon to be phased out later this year Important part of maintaining a cluster is ensuring that the software is up-to-date March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 16

Maintenance Updating software Monitoring OSG tests SAM tests RSV tests What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Updating software Monitoring OSG tests SAM tests RSV tests Maintaining hardware Managing Security Ensure permissions are correct Maintain firewall Documentation Write down everything! Diagnostics Website Cluster monitoring hub Updating Software Make sure all software is up to date EX: BeStMan is being phased out Will have to install HDFS (Hadoop Distributed File System) on SE and integrate it with gridFTP Operating System updates Software compatibility issues EX: latest version of ANTLR does not work well with current version of GUMS and tomcat Have to keep ANTLR at older version Monitoring OSG tests SAM (Site Availability Metric) Tests to see if various services and functions of the cluster are available for use Used by CMS RSV tests Already talked about Maintaining Hardware Hard drive failures Uninterruptable Power Supply maintenance Managing power maintenance for building in which cluster is housed Managing Security Ensure file permissions are correct Ensure access permissions for NFS NASs are correct Without proper permissions anyone could mount NAS’s because NFS is not very secure Maintaining firewall Documentation Write down everything! Write as if providing instructions for someone new 2700 line log file since I’ve started Write pretty scripts Readable code Many comments Diagnostics Website All thing mentioned can be seen there Lets us know what fails in which part of the cluster Crucial to effective cluster monitoring March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 17

Florida Academy of Sciences 81st Annual Conference 9-10 March Questions? Florida Academy of Sciences 81st Annual Conference 9-10 March March 10, 2017 FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann 18