Infrastructure Strategies for Success Behind the University of Florida's Sakai Implementation Chris Cuevas, Systems Administrator Martin Smith, Systems Administrator
12th Sakai Conference – Los Angeles, California – June Design pattern What is a... "A general reusable solution to a commonly occurring problem." [1] [1]
12th Sakai Conference – Los Angeles, California – June Change control, build promotion, deployment Patterns for…
12th Sakai Conference – Los Angeles, California – June Pattern: Baseline set of artifacts for a change What do we consider a complete build? o Version number o Readme file o Change log o SQL scripts o Sakai 'binary' distribution Reduce ambiguity, recovery time, and improves the chance of catching errors early
12th Sakai Conference – Los Angeles, California – June Pattern: build promotion process All changes are load tested and functionally tested against monitoring scripts (i.e. our test cluster is the same size as our prod cluster, and it is monitored like prod) All changes require a full two weeks of testing time, a go/no-go decision at least 4 days before (this allows us to announce the change), and at least a 2 hour maintenance window
12th Sakai Conference – Los Angeles, California – June Pattern: Maintenance for a new build During a deployment/build promotion, we have two strategies: o Rolling restart: Quiesce nodes, upgrade them, and reintroduce them o Full outage: Stop all nodes, upgrade in chunks, apply any SQL, and start them all Session replication is key here for seamless upgrades (and with Sakai, we don't have it).
12th Sakai Conference – Los Angeles, California – June Other Software (OS/DB/etc patches, updates) Patterns for…
12th Sakai Conference – Los Angeles, California – June Patterns: Other updates High risk packages are identified, only updated by those who know the application best All others packages are updated (at least) quarterly Database patches are done best-effort (for now) Rarely, infrastructure-wide changes will affect a particular service worse than others We reserve a weekly maintenance window Least well understood at this time
12th Sakai Conference – Los Angeles, California – June Traffic Management Patterns for…
12th Sakai Conference – Los Angeles, California – June Pattern: Application stack User Traffic dispatching o Sticky TCP traffic to Apache httpd frontends based on perceived health o Cookie based route from httpd to tomcat, with ability to select a node o Both of these fail to failover session information well We’re considering a design pattern where we combine the httpd+tomcat stack and do full NAT dispatching so that we can get more change flexibility Compare other architectures
12th Sakai Conference – Los Angeles, California – June Current cluster layout
12th Sakai Conference – Los Angeles, California – June Current cluster layout as two sites
12th Sakai Conference – Los Angeles, California – June Site-local dispatching
12th Sakai Conference – Los Angeles, California – June Combining more of the stack
12th Sakai Conference – Los Angeles, California – June Pattern: Resource clustering Database failover is automatic now with Oracle & JDBC File tier still doesn't do failover in any nice way Application+web tier no longer complex dependencies (All state for a user lives on a single server now) Split presence across two sites for database (dataguard), file storage (emc celerra), app/web tier (vmware)
12th Sakai Conference – Los Angeles, California – June Monitoring and logging Patterns for…
12th Sakai Conference – Los Angeles, California – June Pattern: System health checks Overall: o Fully synthetic login to Sakai o Cluster checks on Apache and Tomcat (more than X out of Y servers in the cluster in a bad state) o Wget? Individual server checks for web, app, db tiers o Database connection pool o Clock, SNMP, Ping, Disk o Java processes, Apache configtest o AJP and Web response time and status codes o Replication health, available storage growth
12th Sakai Conference – Los Angeles, California – June Pattern: Interventions Fully automated functional test that authenticates and requests some course sites Response time is as-important as success or failure We’re hesitant to automatically restart application nodes, since session replication isn’t available – this would be a major interruption to our users
12th Sakai Conference – Los Angeles, California – June Pattern: Collecting data Collect the usual suspects sakai events, automatic (?) thread dumps to detect stuck processes, server-status results Sakai health:.jsp file that dumps many data points (JVM memory, ehcache stats, database pools, etc) Anything we can pull from the JVM or Sakai APIs, we’ll use that jsp file and collectd
12th Sakai Conference – Los Angeles, California – June Pattern: Application responsiveness Also known as, "Get close to the user" Bug reports are aggregated using shared mailbox, send daily/weekly/yearly reports with buckets for browser, user, course site, tool, stack trace hash, etc Redirection for 4XX/5XX http status codes as much as possible, with explanations Timeouts for long-running activities, so make sure traffic isn’t waiting forever Watch for AJP errors from specific application servers
Summary of weekly Sakai bug reports for : browser-id => count: Mac-Mozilla => 377 Win-InternetExplorer => 356 Win-Mozilla => 194 UnknownBrowser => 33 empty => 12 service-version => count: [r329] => 967 empty => 8 user => count: atorres78 (Alina Torres) => 32 lisareeve (Lisa Jacobs) => 26 ziggy41 (Stefan Katz) => 15 ngrosztenger (Nathalie Grosz-Tenger) => 14 agabriel2450 (Gabriel Arguello) => 12 stack-trace-digest => count: 41D7C94702B20B270953EBB00ECA9F5C1388A393 => 180 DEB88C2307DA572C9C1EFE1E8E17828DC29A7C00 => 154 A600DAE1792C82B1472C9980EED8938E5F39B4F0 => E2F E1BC1A24DF953560B7845BDCE => CF39E8D34570CD3D79152B757A090AB6AB39F => 24 app-server => count: sakaiapp-prod06.osg.ufl.edu => 154 sakaiapp-prod02.osg.ufl.edu => 146 sakaiapp-prod04.osg.ufl.edu => 118 sakaiapp-prod05.osg.ufl.edu => 96 sakaiapp-prod03.osg.ufl.edu => 83
12th Sakai Conference – Los Angeles, California – June Backup and recovery Patterns for…
12th Sakai Conference – Los Angeles, California – June Pattern: Backing up for DR File tier is backed up every 4 hours, with a 2 week retention window Database tier is backed up daily, with archived redo logs every 4 hours, and 2 week retention window
12th Sakai Conference – Los Angeles, California – June Pattern: Backing up user data Hoping this comes from application-specific operations to backup and restore (and delete!) user specific data Can't do a full restore of your files and database every time your user deletes a site by accident Strive for reasonable windows of retention (e.g. hardware, software, application-level data) This is supposedly coming in Sakai 2.x
12th Sakai Conference – Los Angeles, California – June Pattern: Multi-site replication Database and file tier are both replicated to a 2 nd site, file tier is also redundant internally, some manual intervention still required there
12th Sakai Conference – Los Angeles, California – June Pattern: Bringing production to test We use ‘snapshot standby’ in Oracle RDBMS to take read consistent copies of production for reloading test and development copies We use rsync to copy over the file storage tier With our full set of build artifacts from earlier, we can always build a complete version of what's in prod
12th Sakai Conference – Los Angeles, California – June Thank you! Questions?