Dynamic Virtual Clusters in a Grid Site Manager Jeff Chase, David Irwin, Laura Grit, Justin Moore, Sara Sprenkle Department of Computer Science Duke University.

Slides:

Advertisements

Similar presentations

© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.

Advertisements

CPSCG: Constructive Platform for Specialized Computing Grid Institute of High Performance Computing Department of Computer Science Tsinghua University.

Cloud computing is used to describe a variety of computing concepts that involve a large number of computers connected through a real-time communication.

Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.

1 Dynamic DNS. 2 Module - Dynamic DNS ♦ Overview The domain names and IP addresses of hosts and the devices may change for many reasons. This module focuses.

Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.

Cereus: CyberInfrastructure Environments for Resource Exchange and Utility Services Duke University, Department of Computer Science

CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.

6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.

1 GENI: Global Environment for Network Innovations Jennifer Rexford Princeton University

Microsoft Virtual Server 2005 Product Overview Mikael Nyström – TrueSec AB MVP Windows Server – Setup/Deployment Mikael Nyström – TrueSec AB MVP Windows.

Hands-On Microsoft Windows Server 2003 Networking Chapter 1 Windows Server 2003 Networking Overview.

Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 

Hussain Ali Department of Computer Engineering KFUPM, Dhahran, Saudi Arabia Microsoft Networking.

Slide 3-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 3 Operating System Organization.

+ Virtualization in Clusters and Grids Dr. Lizhe Wang.

11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Module 13: Network Load Balancing Fundamentals. Server Availability and Scalability Overview Windows Network Load Balancing Configuring Windows Network.

Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.

Module 8 Configuring and Securing SharePoint Services and Service Applications.

Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.

+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

High Performance Computing Cluster OSCAR Team Member Jin Wei, Pengfei Xuan CPSC 424/624 Project ( 2011 Spring ) Instructor Dr. Grossman.

Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.

Module 9: Configuring Storage

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Appendix B Planning a Virtualization Strategy for Exchange Server 2010.

Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.

Presented by: Sanketh Beerabbi University of Central Florida COP Cloud Computing.

A Federation Architecture for DETER Ted Faber, John Wroclawski, Kevin Lahey, John Hickey University of Southern California Information Sciences Institute.

N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

9 Systems Analysis and Design in a Changing World, Fourth Edition.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

EVGM081 Multi-Site Virtual Cluster: A User-Oriented, Distributed Deployment and Management Mechanism for Grid Computing Environments Takahiro Hirofuchi,

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.

Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.

© 2008 IBM Corporation AIX Workload Partion Manger.

Slide 1 Cluster-on-Demand (COD) Justin Moore Duke University.

Chapter 3 Selecting the Technology. Agenda Internet Technology –Architecture –Protocol –ATM IT for E-business –Selection Criteria –Platform –Middleware.

Linux Operations and Administration

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

You there? Yes Network Health Monitoring Heartbeats are sent to monitor health status of network interfaces Are sent over all cluster.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

Information Initiative Center, Hokkaido University North 11, West 5, Sapporo , Japan Tel, Fax: General.

Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Automated Cost-Aware Data Center Management (part 1) Justin Moore Advisor: Jeff Chase Committee Parthasarathy Ranganathan, Carla Ellis, Alvin Lebeck, Jun.

1 Platform LSF6 What’s new in LSF6

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Maximum Availability Architecture Enterprise Technology Centre.

Oracle Solaris Zones Study Purpose Only

GGF15 – Grids and Network Virtualization

20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.

Web switch support for differentiated services

Cereus: CyberInfrastructure Environments for Resource Exchange and Utility Services Duke University, Department of Computer Science

Presentation transcript:

Dynamic Virtual Clusters in a Grid Site Manager Jeff Chase, David Irwin, Laura Grit, Justin Moore, Sara Sprenkle Department of Computer Science Duke University

Dynamic Virtual Clusters Grid Services

Motivation Next Generation Grid Flexibility Dynamic instantiation of software environments and services Predictability Resource reservations for predictable application service quality Performance Dynamic adaptation to changing load and system conditions Manageability Data center automation

Cluster-On-Demand (COD) COD Virtual Cluster #1 DHCP DNS NIS NFS COD database (templates, status) Virtual Cluster #2 Differences: OS (Windows, Linux) Attached File Systems Applications User accounts Goals for this talk Explore virtual cluster provisioning Middleware integration (feasibility, impact)

Cluster-On-Demand and the Grid Safe to donate resources to the grid Resource peering between companies or universities Isolation between local users and grid users Balance local vs. global use Controlled provisioning for grid services Service workloads tend to vary with time Policies reflect priority or peering arrangements Resource reservations Multiplex many Grid PoPs Avaki and Globus on the same physical cluster Multiple peering arrangements

Outline Overview Motivation Cluster-On-Demand System Architecture Virtual Cluster Managers Example Grid Service: SGE Provisioning Policies Experimental Results Conclusion and Future Work

System Architecture GridEngine C COD Manager Sun GridEngine Batch Pools within Three Isolated Vclusters XML-RPC Interface Provisioning Policy VCM GridEngine Middleware Layer GridEngine Commands Node reallocation B A

Virtual Cluster Manager (VCM) Communicates with COD Manager Supports graceful resizing of vclusters Simple extensions for well-structured grid services Support already present Software handles membership changes Node failures and incremental growth Application services can handle this gracefully Vcluster COD Manager VCM Service add_nodes remove_nodes resize

Sun GridEngine Ran GridEngine middleware within vclusters Wrote wrappers around GridEngine scheduler Did not alter GridEngine Most grid middleware can support modules Vcluster COD Manager VCM Service add_nodes remove_nodes resize qconf qstat

Pluggable Policies Local Policy Request a node for every x jobs in the queue Relinquish a node after being idle for y minutes Global Policies Simple Policy Each vcluster has a priority Higher priority vclusters can take nodes from lower priority vclusters Minimum Reservation Policy Each vcluster guaranteed percentage of nodes upon request Prevents starvation

Outline Overview Motivation Cluster-On-Demand System Architecture Virtual Cluster Managers Example Grid Service: SGE Provisioning Policies Experimental Results Conclusion and Future Work

Experimental Setup Live Testbed Devil Cluster (IBM, NSF) 71 node COD prototype Trace driven---sped up traces to execute in 12 hours Ran synthetic applications Emulated Testbed Emulates the output of SGE commands Invisible to the VCM that is using SGE Trace driven Facilitates fast, large scale tests Real batch traces Architecture, BioGeometry, and Systems groups

Live Test

Architecture Vcluster

Emulation Architecture COD Manager Each Epoch 1.Call resize module 2.Pushes emulation forward one epoch 3.qstat returns new state of cluster 4.add_node and remove_node alter emulator XML-RPC Interface VCM Emulator Emulated GridEngine FrontEnd qstat Trace Load Generation Architecture Systems BioGeometry COD Manager and VCM are unmodified from real system Provisioning Policy

Minimum Reservation Policy

Emulation Results Minimum Reservation Policy Example policy change Removed starvation problem Scalability Ran same experiment with 1000 nodes in 42 minutes making all node transitions that would have occurred in 33 days There were 3.7 node transitions per second resulting in approximately 37 database accesses per second. Database scalable to large clusters

Related Work Cluster Management NOW, Beowulf, Millennium, Rocks Homogenous software environment for specific applications Automated Server Management IBM’s Oceano and Emulab Target specific applications (Web services, Network Emulation) Grid COD can support GARA for reservations SNAP combines SLAs of resource components COD controls resources directly

Future Work Experiment with other middleware Economic-based policy for batch jobs Distributed market economy using vclusters Maximize profit based on utility of applications Trade resources between Web Services, Grid Services, batch schedulers, etc.

Conclusion No change to GridEngine middleware Important for Grid services Isolates grid resources from local resources Enables policy-based resource provisioning Policies are pluggable Prototype system Sun GridEngine as middleware Emulated system Enables fast, large-scale tests Test policy and scalability

Example Epoch GridEngine Architecture Nodes Systems Nodes BioGeometry Nodes COD Manager Sun GridEngine Batch Pools within Three Isolated Vclusters VCM GridEngine 2a. qstat 2b. qstat 2c. qstat 1abc.resize 3a.nothing 3c.remove 3b.request 5. Make Allocations Update Database Configure nodes 4,6. Format and Forward requests 7c.remove_node 7b.add_node 8c. qconf remove_host 8b. qconf add_host Node reallocation

New Cluster Management Architecture Cluster-On-Demand Secure isolation of multiple user communities Custom software environments Dynamic policy-based resource provisioning Acts as a Grid Site Manager Virtual clusters Host different user groups and software environments in isolated partitions Virtual Cluster Manager (VCM) Coordinates between local and global clusters

Dynamic Virtual Clusters Varying demand over time Negotiate resource provisioning by interfacing with application specific service manager Logic for monitering load and changing membership Fundamental for the next-generation grid COD controls local resources Exports a resource negotiation interface to local grid service middleware Vclusters encapsulate batch schedulers, Web services, Grid Services No need to place more complicated resource management into grid service middleware

Resource Negotiation Flexible, extensible policies for resource management Secure Highly Available Resource Peering (SHARP) Secure external control of site resources Soft-state reservations of resource shares for specific time intervals COD Manager and VCM communicate through XML-RPC interface

Cluster-On-Demand (COD) COD Web Services VCM DHCP DNS NIS NFS COD database (templates, status) Differences: OS (Windows, Linux) Attached File Systems Applications User accounts Goals Explore virtual cluster provisioning Middleware integration (feasibility, impact) Non-goals Mechanism for managing and switching configurations Clients Batch Scheduler

Example Node Reconfiguration 1.Node comes online 2.DHCP queries status from database 3.If new config—loads minimum trampoline OS PXELinux 1.Generic x86 Linux kernel and RAM-based root file system 4.Sends summary of hardware to confd 5.Confd directs trampoline to partition drives and install images (from database) 6.COD assigns IP addresses within a subnet for each vcluster 1.Vcluster occupies private DNS domain (MyDNS) 2.Executes within predefined NIS domain, enables access for user identities 3.COD exports NFS file storage volumes 1.Nodes obtain NFS mount map through NIS 7.Web Interface Differences: OS (Windows, Linux) Attached File Systems Applications User accounts

System Architecture GridEngine Architecture Nodes Systems Nodes BioGeometry Nodes COD Manager Sun GridEngine Batch Pools within Three Isolated Vclusters XML-RPC Interface Local Provisioning Policy Global Provisioning Policy VCM GridEngine Middleware Layer add_nodes remove_nodes resize qconf qstat qsub GridEngine Commands Node reallocation Load from users

Outline Overview Motivation Cluster-On-Demand System Architecture System Design Provisioning Policies Experimental Results Conclusion and Future Work