Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation Chris Cuevas, Systems Administrator Martin.

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

13,000 Jobs and counting…. Advertising and Data Platform Our System.
INTRODUCTION TO ORACLE Lynnwood Brown System Managers LLC Oracle High Availability Solutions RAC and Standby Database Copyright System Managers LLC 2008.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Cluster architecture for Java web hosting at CERN CHEP 2006, Mumbai Michał Kwiatek, CERN IT Department Database and Engineering Services Group.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Backup The flip side of recovery. Types of Failures Transaction failure –Transaction must be aborted System failure –Hardware or software problem resulting.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
Oracle backup and recovery strategy
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
What’s new in Stack 3.2 Michael Youngstrom. Disclaimer This IS a presentation – So sit back and relax Please ask questions.
Online Database Support Experiences Diana Bonham, Dennis Box, Anil Kumar, Julie Trumbo, Nelly Stanfield.
Eric Westfall – Indiana University James Bennett – Indiana University ADMINISTERING A PRODUCTION KUALI RICE INFRASTRUCTURE.
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
GigaSpaces Global HTTP Session Sharing October 2013 Massive Web Application Scaling.
SQLintersection SQL37 SQL Server 2012 Availability Groups: High Availability for Your Most Important Data Aaron Bertrand,
Ideas to Improve SharePoint Usage 4. What are these 4 Ideas? 1. 7 Steps to check SharePoint Health 2. Avoid common Deployment Mistakes 3. Analyze SharePoint.
1 Microsoft Exchange 2000 Server Maintenance and Troubleshooting System Maintenance and Monitoring Database Operation and Maintenance Backup, Restore,
SQLintersection Session SQL37 SQL Server 2012 Availability Groups Aaron Bertrand
Maintaining Large Vista Installations Amy Edwards, Ezra Freelove, & George Hernandez July 12, 2007.
Maintaining Large Vista Installations Amy Edwards, Ezra Freelove, & George Hernandez July 12, 2007.
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
ArcGIS Server for Administrators
Kuali Rice at Indiana University From the System Owner Perspective July 29-30, 2008 Eric Westfall.
OSIsoft High Availability PI Replication
Plug-in for Singleton Service in Clustered environment and improving failure detection methodology Advisor:By: Dr. Chung-E-WangSrinivasa c Kodali Department.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
High Availability in DB2 Nishant Sinha
Alwayson Availability Groups
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Enhancing Scalability and Availability of the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
Business Continuity Planning for OPEN OPEN Development Conference September 18, 2008 Ravi Rajaram IT Development Manager.
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
Virtual Machine Movement and Hyper-V Replica
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Narasimha Reddy Gopu Jisha J. Agenda Introduction to AlwaysOn * AlwaysOn Availability Groups (AG) & Listener * AlwaysOn Failover * AlwaysOn Active Secondaries.
High-Availability MySQL with DR:BD and Heartbeat: MTV Japan mobile services ©2008 MTV Networks Japan K.K.
SQL Server High Availability Introduction to SQL Server high availability solutions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Sql Server Architecture for World Domination Tristan Wilson.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
Calgary Oracle User Group
Backups for Azure SQL Databases and SQL Server instances running on Azure Virtual Machines Session on backup to Azure feature (manual and managed) in SQL.
High Availability 24 hours a day, 7 days a week, 365 days a year…
Shared Services with Spotfire
Integrating HA Legacy Products into OpenSAF based system
Lead SQL BankofAmerica Blog: SQLHarry.com
Maximum Availability Architecture Enterprise Technology Centre.
Introduction of Week 6 Assignment Discussion
Continuous Performance Engineering
SpiraTest/Plan/Team Deployment Considerations
AlwaysOn Availability Groups
High Availability/Disaster Recovery Solution
February 11-13, 2019 Raleigh, NC.
Designing Database Solutions for SQL Server
Presentation transcript:

Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation Chris Cuevas, Systems Administrator Martin Smith, Systems Administrator

12th Sakai Conference – Los Angeles, California – June Design pattern What is a... "A general reusable solution to a commonly occurring problem." [1] [1]

12th Sakai Conference – Los Angeles, California – June Change control, build promotion, deployment Patterns for…

12th Sakai Conference – Los Angeles, California – June Pattern: Baseline set of artifacts for a change What do we consider a complete build? o Version number o Readme file o Change log o SQL scripts o Sakai 'binary' distribution Reduce ambiguity, recovery time, and improves the chance of catching errors early

12th Sakai Conference – Los Angeles, California – June Pattern: build promotion process All changes are load tested and functionally tested against monitoring scripts (i.e. our test cluster is the same size as our prod cluster, and it is monitored like prod) All changes require a full two weeks of testing time, a go/no-go decision at least 4 days before (this allows us to announce the change), and at least a 2 hour maintenance window

12th Sakai Conference – Los Angeles, California – June Pattern: Maintenance for a new build During a deployment/build promotion, we have two strategies: o Rolling restart: Quiesce nodes, upgrade them, and reintroduce them o Full outage: Stop all nodes, upgrade in chunks, apply any SQL, and start them all Session replication is key here for seamless upgrades (and with Sakai, we don't have it).

12th Sakai Conference – Los Angeles, California – June Other Software (OS/DB/etc patches, updates) Patterns for…

12th Sakai Conference – Los Angeles, California – June Patterns: Other updates High risk packages are identified, only updated by those who know the application best All others packages are updated (at least) quarterly Database patches are done best-effort (for now) Rarely, infrastructure-wide changes will affect a particular service worse than others We reserve a weekly maintenance window Least well understood at this time

12th Sakai Conference – Los Angeles, California – June Traffic Management Patterns for…

12th Sakai Conference – Los Angeles, California – June Pattern: Application stack User Traffic dispatching o Sticky TCP traffic to Apache httpd frontends based on perceived health o Cookie based route from httpd to tomcat, with ability to select a node o Both of these fail to failover session information well We’re considering a design pattern where we combine the httpd+tomcat stack and do full NAT dispatching so that we can get more change flexibility Compare other architectures

12th Sakai Conference – Los Angeles, California – June Current cluster layout

12th Sakai Conference – Los Angeles, California – June Current cluster layout as two sites

12th Sakai Conference – Los Angeles, California – June Site-local dispatching

12th Sakai Conference – Los Angeles, California – June Combining more of the stack

12th Sakai Conference – Los Angeles, California – June Pattern: Resource clustering Database failover is automatic now with Oracle & JDBC File tier still doesn't do failover in any nice way Application+web tier no longer complex dependencies (All state for a user lives on a single server now) Split presence across two sites for database (dataguard), file storage (emc celerra), app/web tier (vmware)

12th Sakai Conference – Los Angeles, California – June Monitoring and logging Patterns for…

12th Sakai Conference – Los Angeles, California – June Pattern: System health checks Overall: o Fully synthetic login to Sakai o Cluster checks on Apache and Tomcat (more than X out of Y servers in the cluster in a bad state) o Wget? Individual server checks for web, app, db tiers o Database connection pool o Clock, SNMP, Ping, Disk o Java processes, Apache configtest o AJP and Web response time and status codes o Replication health, available storage growth

12th Sakai Conference – Los Angeles, California – June Pattern: Interventions Fully automated functional test that authenticates and requests some course sites Response time is as-important as success or failure We’re hesitant to automatically restart application nodes, since session replication isn’t available – this would be a major interruption to our users

12th Sakai Conference – Los Angeles, California – June Pattern: Collecting data Collect the usual suspects sakai events, automatic (?) thread dumps to detect stuck processes, server-status results Sakai health:.jsp file that dumps many data points (JVM memory, ehcache stats, database pools, etc) Anything we can pull from the JVM or Sakai APIs, we’ll use that jsp file and collectd

12th Sakai Conference – Los Angeles, California – June Pattern: Application responsiveness Also known as, "Get close to the user" Bug reports are aggregated using shared mailbox, send daily/weekly/yearly reports with buckets for browser, user, course site, tool, stack trace hash, etc Redirection for 4XX/5XX http status codes as much as possible, with explanations Timeouts for long-running activities, so make sure traffic isn’t waiting forever Watch for AJP errors from specific application servers

Summary of weekly Sakai bug reports for : browser-id => count: Mac-Mozilla => 377 Win-InternetExplorer => 356 Win-Mozilla => 194 UnknownBrowser => 33 empty => 12 service-version => count: [r329] => 967 empty => 8 user => count: atorres78 (Alina Torres) => 32 lisareeve (Lisa Jacobs) => 26 ziggy41 (Stefan Katz) => 15 ngrosztenger (Nathalie Grosz-Tenger) => 14 agabriel2450 (Gabriel Arguello) => 12 stack-trace-digest => count: 41D7C94702B20B270953EBB00ECA9F5C1388A393 => 180 DEB88C2307DA572C9C1EFE1E8E17828DC29A7C00 => 154 A600DAE1792C82B1472C9980EED8938E5F39B4F0 => E2F E1BC1A24DF953560B7845BDCE => CF39E8D34570CD3D79152B757A090AB6AB39F => 24 app-server => count: sakaiapp-prod06.osg.ufl.edu => 154 sakaiapp-prod02.osg.ufl.edu => 146 sakaiapp-prod04.osg.ufl.edu => 118 sakaiapp-prod05.osg.ufl.edu => 96 sakaiapp-prod03.osg.ufl.edu => 83

12th Sakai Conference – Los Angeles, California – June Backup and recovery Patterns for…

12th Sakai Conference – Los Angeles, California – June Pattern: Backing up for DR File tier is backed up every 4 hours, with a 2 week retention window Database tier is backed up daily, with archived redo logs every 4 hours, and 2 week retention window

12th Sakai Conference – Los Angeles, California – June Pattern: Backing up user data Hoping this comes from application-specific operations to backup and restore (and delete!) user specific data Can't do a full restore of your files and database every time your user deletes a site by accident Strive for reasonable windows of retention (e.g. hardware, software, application-level data) This is supposedly coming in Sakai 2.x

12th Sakai Conference – Los Angeles, California – June Pattern: Multi-site replication Database and file tier are both replicated to a 2 nd site, file tier is also redundant internally, some manual intervention still required there

12th Sakai Conference – Los Angeles, California – June Pattern: Bringing production to test We use ‘snapshot standby’ in Oracle RDBMS to take read consistent copies of production for reloading test and development copies We use rsync to copy over the file storage tier With our full set of build artifacts from earlier, we can always build a complete version of what's in prod

12th Sakai Conference – Los Angeles, California – June Thank you! Questions?