PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Slides:



Advertisements
Similar presentations
Coach Transportation Ticketing System
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Building a Sustainable Data Center Matthew Holmes Johnson County Community College.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
From Entrepreneurial to Enterprise IT Grows Up Nate Baxley – ATLAS Rami Dass – ATLAS
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.
© The Trustees of Indiana University Centralize Research Computing to Drive Innovation…Really Thomas J. Hacker Research & Academic Computing University.
Cloud computing Tahani aljehani.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
User Services. Services Desktop Support Technical Support Help Desk User Services Customer Relationship Management.
“ Does Cloud Computing Offer a Viable Option for the Control of Statistical Data: How Safe Are Clouds” Federal Committee for Statistical Methodology (FCSM)
TG RoundTable, Purdue RP Update October 11, 2008 Carol Song Purdue RP PI Rosen Center for Advanced Computing.
Implementing a Central Quill Database in a Large Condor Installation Preston Smith Condor Week April 30, 2008.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over the Internet. Cloud is the metaphor for.
GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
August 27, 2008 Platform Market, Business & Strategy.
Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.
Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.
Information Technology at Purdue Presented by: Dr. Gerry McCartney Vice President and CIO, ITaP HPC User Forum September 8-10, 2008 Using SiCortex SC5832.
Oracle Open World 2014 Integrating PeopleSoft for Seamless IT Service Delivery: Tips from UCF 1 Robert Yanckello, Chief Technology Officer University of.
OSG Area Coordinators Campus Infrastructures Update Dan Fraser Miha Ahronovitz, Jaime Frey, Rob Gardner, Brooklin Gore, Marco Mambelli, Todd Tannenbaum,
NETWORK FILE ACCESS SECURITY Daniel Mattingly EKU, Dept. of Technology, CEN/CET.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
SG - OSG Improving Campus Research CI Through Leveraging and Integration: Developing a SURAgrid-OSG Collaboration John McGee, RENCI/OSG Engagement Coordinator.
What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Information Technology Cost Pool Council of Research Associate Deans March 12, 2009.
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Purdue Campus Grid Preston Smith Condor Week 2006 April 24, 2006.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Implementing an Institutional Repository: Part III 16 th North Carolina Serials Conference March 29, 2007 Resource Issues.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
Leveraging the InCommon Federation to access the NSF TeraGrid Jim Basney Senior Research Scientist National Center for Supercomputing Applications University.
GridLab Resource Management System (GRMS) Jarek Nabrzyski GridLab Project Coordinator Poznań Supercomputing and.
Portal Update Plan Ashok Adiga (512)
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
Funding: Staffing for Research Computing What staffing models does your institution use for research computing? How does your institution pay for the staffing.
11/15/04PittGrid1 PittGrid: Campus-Wide Computing Environment Hassan Karimi School of Information Sciences Ralph Roskies Pittsburgh Supercomputing Center.
2005 GRIDS Community Workshop1 Learning From Cyberinfrastructure Initiatives Grid Research Integration Development & Support
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
What is HPC? High Performance Computing (HPC)
An Introduction to Cloud Computing
Module 1: Introduction to Administering Accounts and Resources
THE STEPS TO MANAGE THE GRID
TYPES OF SERVER. TYPES OF SERVER What is a server.
Infrastructure, Data Center & Managed Services
Leigh Grundhoefer Indiana University
Introduce yourself Presented by
Basic Grid Projects – Condor (Part I)
Presentation transcript:

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University

Introduction –Environment –Motivation Challenges –Infrastructure –Usage Tracking –Storage –Staffing Future Work Results BoilerGrid

BoilerGrid - Growth How did we get from here…. To here?

BoilerGrid - Rosen Center for Advanced Computing Research Computing arm of ITaP - Information Technology at Purdue Clusters in RCAC are arranged in larger “Community Clusters” –One cluster, one configuration, many owners –Leverages economies of scale for purchasing, and provides expertise in systems engineering, user support, and networking

BoilerGrid - Motivation Early on, we recognized that the diverse owners of the community clusters don’t use the machine at 100% capacity –Community clusters used approximately 70% of capacity –Condor installed on community clusters to cycle- scavenge from PBS, the primary scheduler Goal: provide a general-purpose high- throughput computing resource on existing hardware

BoilerGrid - Challenges In 2005, the Condor deployment at Purdue was unable to scale to the size of the clusters, and ran on an old version of the software An overhaul of the Condor infrastructure was needed!

BoilerGrid - Keep Condor Up-to-date Upgrading Condor –In late 2005, we were running Condor version 6.6.5, which was 1.5 years old. –First, we needed to upgrade! In a large, busy, Condor grid, we found it’s usually advantageous to run the development release of Condor –Early access to new features, scalability improvements

BoilerGrid - Pool Design Use many machines –In 2005, we ran a single Condor pool with ~1800 machines. In 2005, the largest single Condor pools in existence were ~1000 machines. –We implemented BoilerGrid as a flock of 4 pools, of up to 1200 machines each. –Implementing BoilerGrid today? Would have looked much different!

BoilerGrid - Submit Hosts Many submit hosts –In 2005, a single host ran the Condor schedd and could submit jobs –Today, any machine in RCAC for user login, and in many cases end-user desktops are able to submit Condor jobs

BoilerGrid - Challenges Usage Tracking –Tracking job-level accounting with a large Condor pool is difficult –Job history resides on every submit host –Recent versions of Condor’s Quill software allow for a central database holding job (and machine) information Deploying this on BoilerGrid now

BoilerGrid - Storage If your users expect to run jobs using a shared filesystem, a large Condor installation can overwhelm NFS servers. DAGMan and user logs on NFS can cause problems –The defaults don’t allow this for a reason! Train users to rely less on the shared filesystem and take advantage of Condor’s ability to transfer files

BoilerGrid - Expansion Successful use of Condor in clusters led us to identify partners around campus –Student computer labs operated by sister unit in ITaP (2500 machines and growing) –Library terminals (200 machines) –Other campuses (500+ machines) Management support is critical! –Purdue’s CIO supports using Condor on many machines run by ITaP, including the one on his own desk

BoilerGrid - Expansion An even better route of expansion –Condor users adding their own resources Machines in their own lab All the machines in their department With distributed ownership comes new challenges –Regular contact with owner’s system administration staff –Ensure that owners are able to set their own policies

BoilerGrid - Staffing Implementing BoilerGrid required minimal staff effort –Assuming an existing IT infrastructure exists that can operate many machines –.25 FTE ongoing to maintain Condor and coordinating with distributed Condor installations With success comes more demand, and the end-user support to go along with it –1.5 science support consultants assist with porting codes,training users to effectively use Condor

BoilerGrid - Future Work TeraGrid (NSF HPCOPS) - Portal for submission and monitoring of Condor jobs Centralized Quill database for job and machine state –Excellent source of data for future research in distributed systems

BoilerGrid - Results YearPool Size JobsHours Delivered Unique Users ,551346, ,7171,695, ,251,9815,527, ,611,8139,524, ??63 so far..

BoilerGrid - Results

BoilerGrid - Conclusions Condor is a powerful tool for getting real science done on otherwise unused hardware Questions?