Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.

Slides:



Advertisements
Similar presentations
Copyright © 2012 DataCore Software Corp. – All Rights Reserved. Practical High Availability NAS Cost-effective, non-stop disk access for clustered file.
Advertisements

Web Performance Tuning Lin Wang, Ph.D. US Department of Education Copyright [Lin Wang] [2004]. This work is the intellectual property of the author. Permission.
A Successful Help Desk Process for all IT Support
A multi-tiered storage and data protection strategy Carl Follstad Manager, University Data Mgmt Services Office of Information Technology University of.
Student, Faculty, and Staff Data Availability and Protection What’s the Back-Up Plan? (for academic computing) Sponsored by.
© Copyright Computer Lab Solutions All rights reserved. Do you need usage information about your computer labs? Copyright Computer Lab Solutions.
Business Plug-In B4 MIS Infrastructures.
Cut Costs and Increase Productivity in your IT Organization with Effective Computer and Network Monitoring. Copyright © T3 Software Builders, Inc 2004.
MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.
Disaster Recovery Planning Because It’s Time! Copyright Columbia University and Bentley College, This work is the intellectual property of the author.
Protect Your Business and Simplify IT with Symantec and VMware Presenter, Title, Company Date.
Copyright Sylvia Maxwell and Michael White, This work is the intellectual property of the author. Permission is granted for this material to be shared.
Copyright Brian T. Huntley and Tim Antonowicz 2007 This work is the intellectual property of the authors. Permission is granted for this material to be.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania.
Shared File Service VM Forum January, SFS Topics Targeted Usage Security Accessing CIFS Shares Availability & Protection Monitoring Pricing.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Educause Security 2007ISC Information Security Copyright Joshua Beeman, This work is the intellectual property of the author. Permission is granted.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Virtualization Across The Enterprise Rob Lowden Director, Enterprise Infrastructure Indiana University 23 May 2007.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design.
Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin CHAPTER FIVE INFRASTRUCTURES: SUSTAINABLE TECHNOLOGIES CHAPTER.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
The Journey Toward 24/7 IT Monitoring University of North Carolina at Greensboro Design and Build of Network Operations Center Copyright Thomas M. Sheriff,
Copyright C. Grier Yartz This work is the intellectual property of the author. Permission is granted for this material to be shared.
1 sm Using E-Business Solutions to Meet Management Challenges: Interoperability & Flexibility Bring Success to the Implementation of Specialized Components.
Data Centers and IP PBXs LAN Structures Private Clouds IP PBX Architecture IP PBX Hosting.
Jeff McKinney Exchange to Mirapoint Migration January 11, 2006 Securing Exchange to Mirapoint Jeff McKinney University of Maryland Dept of Electrical.
Moving Your Paperwork Online Western Washington University E-Sign Web Forms Copyright Western Washington University, This work is the intellectual.
CAMP - June 4-6, Copyright Statement Copyright Robert J. Brentrup and Mark J. Franklin This work is the intellectual property of the authors.
Copyright Tim Antonowicz, This work is the intellectual property of the author. Permission is granted for this material to be shared for non- commercial,
CAMP Med Mapping HIPAA to the Middleware Layer Sandra Senti Biological Sciences Division University of Chicago C opyright Sandra Senti,
John Graham – STRATEGIC Information Group Steve Lamb - QAD Disaster Recovery Planning MMUG Spring 2013 March 19, 2013 Cleveland, OH 03/19/2013MMUG Cleveland.
Rutgers IT Complex Michael R Mundrane 4 December 2001 Rutgers University Computing Services.
Disaster Recovery as a Cloud Service Chao Liu SUNY Buffalo Computer Science.
High-Availability Methods Lesson 25. Skills Matrix.
IT Business Continuity Briefing March 3,  Incident Overview  Improving the power posture of the Primary Data Center  STAGEnet Redundancy  Telephone.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
IT Infrastructure Chap 1: Definition
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
NOAA WEBShop A low-cost standby system for an OAR-wide budgeting application Eugene F. Burger (NOAA/PMEL/JISAO) NOAA WebShop July Philadelphia.
Co-location Sites for Business Continuity and Disaster Recovery Peter Lesser (212) Peter Lesser (212) Kraft.
©2006 Merge eMed. All Rights Reserved. Energize Your Workflow 2006 User Group Meeting May 7-9, 2006 Disaster Recovery Michael Leonard.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
High Availability in DB2 Nishant Sinha
Install, configure and test ICT Networks
CLOUD COMPUTING WHAT IS CLOUD COMPUTING?  Cloud Computing, also known as ‘on-demand computing’, is a kind of Internet-based computing,
© 2009 Pittsburgh Supercomputing Center Server Virtualization and Security Kevin Sullivan Copyright Kevin Sullivan, Pittsburgh Supercomputing.
Virtual Machine Movement and Hyper-V Replica
This courseware is copyrighted © 2016 gtslearning. No part of this courseware or any training material supplied by gtslearning International Limited to.
COMP1321 Digital Infrastructure Richard Henson March 2016.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
HUAWEI TECHNOLOGIES CO., LTD. Huawei Storage ISM Management Pre-sales Product Training Materials Easy and Efficient WEU IT Solution Team.
A Path to the Community Cloud Making Above Campuses Services a Reality
Providing Application High Availability
Server Upgrade HA/DR Integration
Managing Multi-User Databases
Applications of Virtualization & Automation
High Availability Linux (HA Linux)
Disaster Recovery Technical Infrastructure at George Mason University
Maximum Availability Architecture Enterprise Technology Centre.
What Do We Do? Managed IT services
Business Continuity Technology
SpiraTest/Plan/Team Deployment Considerations
Project for OnLine Instructional Support (POLIS)
myIS.neu.edu – presentation screen shots accompany:
Terry Coatta VP Development, Silicon Chalk
Presentation transcript:

Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania January 14, 2004

Copyright D.Kassabian and S.Huque [2004]. This work is the intellectual property of the authors. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials and notice is given that the copying is by permission of the author. To disseminate otherwise or to republish requires written permission from the authors.

About Penn The University of Pennsylvania was founded by Ben Franklin in 1740 Penn is part of the Ivy League Located in western Philadelphia Community of more than 35,000 people

General Goals Networked services available as expected by our users Minimized time to repair (TTR) for when outages do occur Ability to perform maintenance and upgrades (planned downtime) non- disruptively Cost effectiveness in meeting these goals

Definitions Availability High Availability (HA) Rapid Recovery (RR) Disaster Recovery (DR) Basic Systems

Definitions Disaster Recovery (DR) -The process of restoring a service to full operation after an interruption in service

Definitions Basic System - a Basic System is a {Network, System, Service} with only the most basic of protections against outages Examples: A network recoverable using spare parts A single computer system with RAID disk A service recoverable from tape backups

Definitions Availability - the percentage of total time that a {Network, System, Service} is available for use Related points: Advertised periods of availability Availability as advertised Absolute availability

Definitions High Availability (HA) - a {Network, System, Service} with specific design elements intended to keep availability above a high threshold (eg, 99.99%)

Definitions Rapid Recovery (RR) - a {Network, System, Service} with specific design elements intended to recover from downtime very quickly (eg, 15 minutes)

Metrics Economics of high availability (the costs of non-available) Calculating availability How availability measurements are performed

Economics of high availability What is the cost of an outage in your Student Courseware systems and student record systems Financial systems Primary campus web site and servers DNS, DHCP and AuthN systems Internet connection(s) Development / Gifts systems How much should you be willing to spend to minimize downtime of any or all of these?

Calculating availability Availability can be measured directly through periodic polling (eg, SNMP, Mon, Nagios) A formula for predicting availability of a single component MTBF (MTBF+TTR) 1 TTR (MTBF+TTR) or

Design Principals Towards HA Minimize points of catastrophic failure Maximize redundancy Minimize fault zones Minimize complexity and cost Applying the above principles to Networks Systems Services

Specific examples at Penn High Availability Services Rapid Recovery Services

High Availability Design Strategies employed to achieve HA: Server redundancy Hardware component redundancy Storage redundancy (RAID) Network redundancy Redundant power, A/C, cooling etc Application protocols that can transparently failover to alternate servers Secondary offsite hosting (of some services like DNS)

Rapid Recovery Design Strategies employed to achieve RR: Standby servers and storage Some HA design elements: Hardware redundancy, storage redundancy, network redundancy, power, A/C redundancy etc Note: services deployed in the RR model typically dont have an easy way to transparently failover to alternate servers (eg. , Web etc)

Network Aggregation Point Abbreviation: NAP Machine rooms in separate campus locations that house critical network electronics and servers. Good environmentals and connectivity to campus fiber-optic cable plant Both HA and RR services utilize multiple NAPs

Central Infra. Networks AKA NOC Networks (historical name) 3 highly redundant IP networks that house systems providing critical infrastructure services Each network is triply connected to campus routing core via distinct NAP locations Use of router redundancy protocols (VRRP) & Layer-2 path redundancy (802.1D) for high availability

HA Server Platforms Two sets of three replicated servers 3 KDC servers: central authentication 3 NOC servers: everything else Kerberos runs on separate systems mainly for security reasons.

High Availability: KDCs KDCs (3): 3 distinct machines (kdc1, kdc2, kdc3) Each located in a different campus machine room Each connected to a distinct IP network Via a distinct IP core router Additionally each network is triply connected to the campus routing core via 3 NAPs

High Availability: NOCs 3 NOC systems (a historical name) Provide: DNS, DHCP, NTP, RADIUS plus a few homegrown services Same physical and network connectivity as the KDCs In addition: some servers have a secondary interface on a different NOC network (for reasons to be explained later)

HA Application Failover Kerberos DNS RADIUS NTP DHCP Current spec supports only 2 failover systems Non-HA homegrown services: PennNames

Rapid Recovery service Example: and Web service A set of servers and storage is replicated at two sites: primary and standby Primary site: active servers and storage Secondary site: standby servers and replicated storage Data from 1st site is synchronously replicated to 2nd Two separate fibrechannel networks interconnect systems and storage at both sites Catastrophic failure event: system can be manually reconfigured to use the standby servers and/or secondary storage ( ~ 30 minutes) Servers are located on the HA primary infrastructure network

Experiences at Penn Where these approaches have been helpful Higher availability, non-disruptive maintenance Where they have not Complexity can be hard to manage! Where cost has been high Replicated systems and networks, high-end storage solutions Real availability experience DNS, a critical service, went from 99.0% to % availability!

Future Enhancements Making RR services highly available: clustering, IETF rserpool etc Metropolitan area DR (or better) Others: IP Multipathing Trunking links to servers 802.1ad, SMLT, DMLT or similar Rapid Spanning Tree (IEEE 802.1w) Multi-master KADM service Improved management and monitoring infrastructure

Feedback Questions, comments Your designs, experiences, successes Contact Info: