Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Slides:



Advertisements
Similar presentations
Intel® RPIER 3.1 User Training Joe Schwendt Steve Mancini 7/31/2006.
Advertisements

Visit : Call Us: US: , India:
Visit : Call Us: US: , India:
Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Blackbird: Accelerated Course Archives Using Condor with Blackboard Sam Hoover, IT Systems Architect Matt Garrett, System Administrator.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
WDK Driver Test Manager. Outline HCT and the history of driver testing Problems to solve Goals of the WDK Driver Test Manager (DTM) Automated Deployment.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 28 th June nd European Condor Week Milano Heterogeneous Pools John Kewley
TEAM Basic TotalElectrostatic ManagementAwareness&
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Grid Computing I CONDOR.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Condor Birdbath Web Service interface to Condor
Parallel Optimization Tools for High Performance Design of Integrated Circuits WISCAD VLSI Design Automation Lab Azadeh Davoodi.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Using the Weizmann Cluster Nov Overview Weizmann Cluster Connection Basics Getting a Desktop View Working on cluster machines GPU For many more.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 15 th March 2005 Paradyn / Condor Week Madison, WI Caging the CCLRC Compute Zoo (Activities at.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Understanding and Improving Server Performance
Condor DAGMan: Managing Job Dependencies with Condor
Intermediate HTCondor: Workflows Monday pm
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Maximum Availability Architecture Enterprise Technology Centre.
Printer Admin Print Job Manager
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Genre1: Condor Grid: CSECCR
The Condor JobRouter.
General Purpose Computing with Condor
Presentation transcript:

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 3 Computing, Systems, and Operations Clemson Computing and Information Technology End of Semester archives of all online courses in Blackboard since implementation in GB Oracle DB tied to a 1.2 TB Content system with over 13 million files Spring 2010: 4610 active Blackboard courses, 31,372 total courses in Blackboard Full system backups once a week, nightly incremental backups of entire system Blackboard at Clemson

Condor at Clemson 4 Computing, Systems, and Operations Clemson Computing and Information Technology Clemson has deployed a Condor pool consisting of Windows Vista machines in the public computer labs and several other groups of machines (Linux, Solaris, etc.). These machines are available to Clemson faculty, students, and staff with high-throughput computing needs. Users can create their own Condor submit machines by downloading the appropriate software, and can even contribute their own idle cycles to the pool.

Condor at Clemson 5 Computing, Systems, and Operations Clemson Computing and Information Technology The Palmetto Cluster is a dedicated Linux cluster of 1111 nodes. Each node has 8 cores and GB of RAM. Nodes are sold as “Condos” so that the owner gets a guaranteed slice of time based on the number of nodes that they own each week. Clemson Condor users get time on the system if it is not in use by a Condo owner. We also share cycles via OSG, as the lowest priority user on the system.

Blackbird Archive 7 Computing, Systems, and Operations Clemson Computing and Information Technology Blackboard provides a script for executing batch archives given a list of courses as input. Weekly archive process at Clemson began in Fall 2006 after an accidental deletion of many courses. Started out splitting the course list into four equal chunks and giving each server ¼ of the total course list. All four servers usually finished within 2 hours of each other, total time for the batch was < 24 hours. By Fall 2008, archiving the active courses took 85.5 hours, and the servers finished at widely varying times.

Blackbird Archive 8 Computing, Systems, and Operations Clemson Computing and Information Technology /usr/local/blackboard/apps/content- exchange/bin/batch_ImportExport.sh Archive/Restore: The Archive Course function creates a record of the Course including User interactions. It is most useful for recalling Student performance or interactions at later time. The archive package is saved as a.ZIP file that can be restored to the Blackboard system at another time. In effect, Archive/Restore acts as a backup tool at the individual course level.

Project Blackbird 9 Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 10 Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 11 Computing, Systems, and Operations Clemson Computing and Information Technology

Multiple servers, but 3 cores idle 12 Computing, Systems, and Operations Clemson Computing and Information Technology

Multiple servers, all cores in use 13 Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 14 Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 15 Computing, Systems, and Operations Clemson Computing and Information Technology

Project Blackbird 16 Computing, Systems, and Operations Clemson Computing and Information Technology Determine what to archive (active courses, orgs) Build a course list Create Blackbird submit files Submit DAGMan job to Condor Monitor Condor queue Receive notification when all courses have been archived Steps in the weekly archive process

Project Blackbird 17 Computing, Systems, and Operations Clemson Computing and Information Technology What did it take to implement? Have one or more multi-core machines Choose one machine as your Central Manager Install and configure Condor on each machine Automate course list creation (Query DB or Directory) Automate Condor submit files and Condor DAGMan file creation Automate the whole thing with cron Check log files for errors upon archive completion

Project Blackbird 18 Computing, Systems, and Operations Clemson Computing and Information Technology Custom Condor Configuration DAGMAN_MAX_JOBS_IDLE = 25 DAGMAN_MAX_JOBS_SUBMITTED = 50 ## Force Condor to use Blackboard Private Network NETWORK_INTERFACE = Private Blackboard Net

DAGMan example Computing, Systems, and Operations Clemson Computing and Information Technology # Filename: /usr/local/CMSIntegration/files/Blackbird condor.sub # Generated by condor_submit_dag /usr/local/CMSIntegration/files/Blackbird universe = scheduler executable = /usr/local/condor/bin/condor_dagman getenv = True output = /usr/local/CMSIntegration/files/Blackbird lib.out error = /usr/local/CMSIntegration/files/Blackbird lib.err log = /usr/local/CMSIntegration/files/Blackbird dagman.log remove_kill_sig = SIGUSR1 # Note: default on_exit_remove expression: # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed (e.g., during a reboot). on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = "-f -l. -Debug 3 -Lockfile /usr/local/CMSIntegration/files/Blackbird lock -AutoRescue 1 - DoRescueFrom 0 -Dag /usr/local/CMSIntegration/files/Blackbird CsdVersion $CondorVersion:' '7.2.4' 'Jun' '15' '2009' 'BuildID:' '159529' '$" environment = _CONDOR_DAGMAN_LOG=/usr/local/CMSIntegration/files/Blackbird dagman.out;_CONDOR_MAX_DAGMAN _LOG=0 notification = Complete queue

Condor Submit example 20 Computing, Systems, and Operations Clemson Computing and Information Technology universe = vanilla requirements = (OpSys=="LINUX") && (Memory > 100) && ((Arch=="INTEL") || (Arch=="X86_64")) executable = /usr/local/bin/condorSubmitArchive.pl arguments = shoover-S0000BKBRD_401001,/san/weeklyArchives/ / getenv = True log = /usr/local/logs/bbCondorLogs/archive log notification = Error notify_user = transfer_executable = False when_to_transfer_output = ON_EXIT queue 1

Blackbird Benefits 25 Computing, Systems, and Operations Clemson Computing and Information Technology Reduced total archive time from > 85 hrs to < 24 hrs Job scheduling – all servers finish at the same time Zero impact to Blackboard Performance Automatic suspension/resumption of archives if Load reaches threshold on any core notification upon completion of all archives Load balancing – archive jobs are distributed as cores become available Takes advantage of all available CPU cores instead of just one core per server Use ClassAds to specify architecture and memory requirements for large archive jobs

Project Blackbird 26 Computing, Systems, and Operations Clemson Computing and Information Technology Recent Updates 64 Bit Red Hat 5.4 OS and JVM 1.6 Maximum (affordable) RAM per machine – 32 GB Web page to view queue and status What’s next? Add out of warranty machines to the Blackboard Condor Pool (keep users off of them) Monitoring of queue Automate installation and configuration

Project Blackbird 27 Computing, Systems, and Operations Clemson Computing and Information Technology Questions? Sam Hoover