Presentation is loading. Please wait.

Presentation is loading. Please wait.

Composition and Operation of a Tier-3 cluster on the Open Science Grid

Similar presentations


Presentation on theme: "Composition and Operation of a Tier-3 cluster on the Open Science Grid"— Presentation transcript:

1 Composition and Operation of a Tier-3 cluster on the Open Science Grid
Florida Academy of Sciences st Annual Conference March R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann Department of Physics and Space Sciences Florida Institute of Technology

2 Definition of a Cluster
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Several independent computers are connected together and acting as one One head node through which traffic flows Worker nodes and storage bays are connected through a private, local network Several independent computers are connected together and acting as one One head node through which traffic flows Routes jobs to worker nodes NASs are mounted on it All components connected through local network March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

3 Benefits of Cluster Computing
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Scalability Can add computing power as needed Redundancy If one node fails, the others can carry the load Flexibility Resources can be reallocated at will Reliability Components act independently of one another Scalability Can add computing power as needed Easy to upgrade Redundancy If one node fails, the others can carry the load A cluster can still be operational with several nodes offline Several points of acceptable failure Flexibility Resources can be reallocated at will Nodes can be assigned to different tasks on the fly EX: our cluster Most of the time, the nodes are reserved to compute incoming grid jobs Sometimes, nodes are reserved for local computation Reliability Components act independently of one another Redundancy improves reliability Cluster can stay online (with reduced performance) for almost any number of node failures March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

4 Open Science Grid (OSG)
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Provides computing resources to scientists working on a wide variety of projects Users submit their jobs to the grid, and the grid sends the job to one of the OSG sites Provides computing resources to scientists working on a wide variety of projects Such as: CMS (our affiliation) ATLAS Users submit jobs to grid, then grid sends job to an OSG site March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

5 Tiers of OSG Computing What is a computing cluster?
What does our cluster do? How is our cluster put together? How do we maintain our cluster? Tier 0 CERN Just one site Large data center Services Tier 1 sites Tier 1 National Centers 7 sites EX: Fermilab (Illinois) Support Tier 2 sites Tier 2 Regional Centers 53 sites EX: Closest is at University of Florida Support Tier 3 sites Tier 3 64 smaller sites Perform less computations Not always running FIT is Tier 3 March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

6 United States Tier 3 OSG Sites
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Florida Institute of Technology Melbourne, FL Very many sites Not all functioning March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

7 CMS Project at CERN Compact Muon Solenoid
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Compact Muon Solenoid General detector at Large Hadron Collider (LHC) Has a broad field of uses Studying standard model of particle physics Finding particles that make up dark matter CMS is a major user of the grid and our cluster. CMS = Compact Muon Solenoid General detector at Large Hadron Collider (LHC) Broad field of uses Studying standard model Find dark matter particles CMS is major user of OSG grid and our cluster CMS research at FIT March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

8 Providing High Throughput Computing to All
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? High throughput computing Large amount of computing over long periods of time very expensive and time consuming Not viable for individual researchers to maintain their own computing resources Open Science Grid connects researchers with computing resources Can provide computing resources to groups on campus, should they need it. HT computing expensive and time consuming to maintain High-Performance Large amount of computing power need for short amount of time (hours/days) High-Throughput Large amounts of computing over long periods of time (months/years) Our cluster is high-throughput Not viable for individual researchers to maintain their own computing resources OSG connects researchers with computing resources Can provide computing resources to researchers on campus should they need it March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

9 Specifications of FIT’s Cluster
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Computing Power GHz cores across 20 nodes Storage Capacity 10TB of user storage (NAS-0) 51TB of data storage (NAS-1) LAN Speed: 1 Gb/s Job Computation About 680 jobs a day About 108 computing seconds every real-time second Computing Power GHz cores across 20 nodes One 8-core processor per node Storage Capacity 51TB of data storage (NAS-1) 10TB of user storage (NAS-0) Speed at which the components of the cluster communicate is 1 Gb/s Job Computation Average of 680 jobs per day Average 108 computing seconds per second Processing is done on many different cores on the same time On average 108 of the 160 cores are processing data simultaneously March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

10 2017 Jobs Completed What is a computing cluster?
What does our cluster do? How is our cluster put together? How do we maintain our cluster? Number of jobs per day inconsistant Glow user is Grid Laboratory Of Wisconsin March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

11 2017 Total Hours Completed What is a computing cluster?
What does our cluster do? How is our cluster put together? How do we maintain our cluster? March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

12 What is a computing cluster?
What does our cluster do? How is our cluster put together? How do we maintain our cluster? Which software is installed on which parts We will look at each part March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

13 Rocks, Nodes, & NAS What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Rocks Makes the various components work as a single unit Nodes 20 individual nodes which do actual job computation NAS (Network Attached Storage) NAS-0 RAID 6, 10 TB Contains user directories Mounted on CE and compute nodes NAS-1 RAID 60, 51 TB Storage unit, contains most big files Rocks Cluster building software onto which any OS can be installed Default is CentOS Nodes Perform actual job computation CE routes jobs to nodes Network Attached Storage Rather than store data on own drives, nodes store data on large storage unit Data accessible from every part of the cluster NAS-0 10TB RAID 6 16 750GB drives Size decrease due to RAID 2 parity strips allow for 2 drive failures without data loss Stores user home directories Mounted on CE and compute nodes One of the most important parts of cluster NAS-1 51TB RAID 60 4 RAID 6 groups of 9 drives each Greater speed and reliability than RAID 6 Up to 10 drives can fail without data loss Primary storage for cluster Local computational data Computer backups March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

14 Compute Element (CE) Globus Gatekeeper Listens to the port Squid
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Globus Gatekeeper Listens to the port Squid HTTP proxy Optimizes traffic, transfers data GUMS (Grid User Management Service) Authenticates outside users Maps them to a local account Tomcat Powers GUMS Manages cluster by using several different software Globus Gatekeeper software Listens to ports and allows only properly authenticated users access Squid Enhances performance of HTTP server Caches copies of often-requested documents Reduces bandwidth consumption GUMS (Grid User Management Service) Very important part of the grid-computing aspect of our cluster Allows grid users access to the cluster Authenticates outside users map outside users to a local account Tomcat Apache tomcat powers webservices Like GUMS March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

15 Compute Element (CE) GridFTP Transfers data Works with Bestman
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? GridFTP Transfers data Works with Bestman HTCondor (High Throughput Computing) Manages & schedules jobs RSV (Resource Service Validation) Tests services Ganglia Monitors cluster CRAB (CMS Remote Analysis Builder) Remote CMS jobs created by CRAB are sent to cluster GridFTP Extension of FTP (File Transfer Protocol) specifically for grid computing Used to transfer data over the grid Works with Bestman (talked about later) HTConder High-Throughput Condor Designed for High-Throughput computing rather than High-Performance computing Manages the cluster’s jobs Queues jobs Assigns jobs to nodes Integral part to the operation of the cluster RSV (Resource Service Validation) Tests the various services on the cluster to ensure that they are operational Certificate tests Gridftp Java Job management Run by OSG Ganglia Monitoring software Keeps track of: How many jobs Computing hours Performance of nodes CRAB (CMS Remote Analysis Builder Remote CMS jobs created by CRAB sent to cluster March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

16 Storage Element (SE) XrootD
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? XrootD Gets & distributes files from the Open Science Grid PhEDEx (Physics Experiment Data Export) Transfer management database BeStMan (Berkley Storage Manager) Allows for communication between local storage and the grid Being replaced by HDFS (Hadoop Distributed FIle System) Externally addressable computer separate from CE that specifically manages data transfer XrootD Manages access to grid file repositories Allows cluster to obtain files from OSG PhEDEx (Physics Experiment Data Export) What the CMS project at CERN uses to transfer data Manages movement of files between computing sites BeStMan (Berkley Storage Manager) Allows for communication between local storage and the rest of the grid Soon to be phased out later this year Important part of maintaining a cluster is ensuring that the software is up-to-date March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

17 Maintenance Updating software Monitoring OSG tests SAM tests RSV tests
What is a computing cluster? What does our cluster do? How is our cluster put together? How do we maintain our cluster? Updating software Monitoring OSG tests SAM tests RSV tests Maintaining hardware Managing Security Ensure permissions are correct Maintain firewall Documentation Write down everything! Diagnostics Website Cluster monitoring hub Updating Software Make sure all software is up to date EX: BeStMan is being phased out Will have to install HDFS (Hadoop Distributed File System) on SE and integrate it with gridFTP Operating System updates Software compatibility issues EX: latest version of ANTLR does not work well with current version of GUMS and tomcat Have to keep ANTLR at older version Monitoring OSG tests SAM (Site Availability Metric) Tests to see if various services and functions of the cluster are available for use Used by CMS RSV tests Already talked about Maintaining Hardware Hard drive failures Uninterruptable Power Supply maintenance Managing power maintenance for building in which cluster is housed Managing Security Ensure file permissions are correct Ensure access permissions for NFS NASs are correct Without proper permissions anyone could mount NAS’s because NFS is not very secure Maintaining firewall Documentation Write down everything! Write as if providing instructions for someone new 2700 line log file since I’ve started Write pretty scripts Readable code Many comments Diagnostics Website All thing mentioned can be seen there Lets us know what fails in which part of the cluster Crucial to effective cluster monitoring March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann

18 Florida Academy of Sciences 81st Annual Conference 9-10 March
Questions? Florida Academy of Sciences st Annual Conference March March 10, FAS March 2017 Florida Tech – R. Wojtyla, R. Carlson, H. Blackburn, M. Hohlmann


Download ppt "Composition and Operation of a Tier-3 cluster on the Open Science Grid"

Similar presentations


Ads by Google