Automation, Virtualization, and Integration of a Digital Repository Server Architecture or How to Deploy Three Production DSpaces in One Night and Be Home for Dinner TAMU Libraries Digital Initiatives James Creel, Micah Cooper, Jeremy Huff TCDL 2015 Austin, TX
Talk Outline The OAK Trust digital repository Technical Debts The cost of customization The evolution of server architecture A Trio of Innovations Automation of deployments Virtualization of infrastructure Modularization of customizations Lessons Learned for IT in libraries
The OAK Trust Digital Repository A brief overview and history
OAK Trust A branded, customized DSpace instance hosted in-house at TAMU Libraries Launched as “The Texas A&M University Digital Repository” in 2005 with an eye toward archiving ETDs Rebranded with launch of the OAK (Open Access to Knowledge) Fund Has grown to host ~70,000 items, including articles, books, maps, and photography from diverse sources which underwrites TAMU researchers’ publication fees for open-access journals if they agree to submit their articles to the repository
OAK Trust - Hosting From inception to 2013, hosted on dedicated Solaris hardware. Database, assetstore, and SSO authentication all hosted on their own separate hardware.
OAK Trust Customizations Over half a dozen XMLUI themes Extensive custom Java Expanding/collapsing Community/collection browser Links to collection handles from group listing on profile page Record context to keep page on login Metadata-tree browser within collection Export metadata from search results TAMU , Image Gallery, Primeros Libros, Periodicals, Geofolios, ESL, Capstone, Fanzine
Cumulative Costs of Customization Technical Debt Cumulative Costs of Customization
Manually Upgrading DSpace Preparations in Development Environment Compare old and new configuration files line by line Get a realistic duplicate of the production db mvn package, ant install Tweak configurations as needed Test basic things Themes look good, widgets work, search and browse work, webapps run ok This can be an iterative process instead of three steps; you might end up having to go back into development after problems become apparent in pre-production Configuration tweaks include server addresses and directory paths
Problems with the Development Deployment Configuration files are big, and the old config and new config must be compared line by line Java files reference each other’s contents in structurally and nominally particular ways A change to core code on which your customization depends requires that the customization be rewritten Coding to Java interfaces helps, but interfaces change too
Manually Upgrading DSpace Preparations in Pre-production Mount assetstore and log directory mvn package, ant install Tweak configurations and environment as needed Test more extensive things Authentication works, communication with other servers works Configuration tweaks include server addresses and directory paths Environment changes may include java version, tomcat version, build tools versions
Problems with the Pre-production Deployment Pre-production environment on a physically provisioned machine is rather different from your development one Surprises in the tweaks (e.g. “Oh, we need Java 1.7 not 1.6”) must be meticulously recorded in anticipation of the ultimate production deployment We develop typically on Macs, and historically were deploying to Solaris.
Manually Upgrading DSpace in Production Announce plans for downtime to customers and family members Mount assetstore and log directory mvn package, ant install Tweak configurations and environment as needed Test even more things Authentication still works, handle server works, statistics still showing up, all the webapps reply Configuration tweaks include server addresses and directory paths Environment changes may include java version, tomcat version, build tools versions
Problems with the Production Deployment The expanded to-do list for the pre-production deployment may be lengthy – the team is then expected to perform the procedure identically on the production box with minimal downtime Production environment on a physically provisioned machine is always at least a little different from the pre-production one
Summary of Problems Rewriting features to work after changes in the stock code base Hardware and software environment differences Reproducing an extensive, detailed process perfectly, by hand, late into the evening
Three Remedies to the Deployment Problems Relieving the burden of technical debt Modularization of Code Virtualization of Infrastructure Automation of Deployment
Modularization of code Problem: Rewriting custom features to work after changes in the stock code base Solution: separate out customizations and cleanly integrate them with core code Solution in context: DSpace modules modularizing XSL pull requests to DuraSpace
Modularization of code Dspace Modules: Since DSpace 3x customizations to core DSpace are possible by overriding core files with custom files placed in a modules directory. Adding your own customization need not disturb the core code base.
Modularization of code Modularizing XSL: We have continued in this principle of hierarchical modularization by putting empty placeholders in stock XSL, enabling extension in sub-themes. This has lead to an extreme reduction in redundant code—with some file being reduced in excess of 90% BEFORE AFTER
Modularization of code Pull requests to DuraSpace: Technical debt can be further reduced by adopting the open source mindset of developing for the larger community first, as opposed to an institutionally centric approach If a custom feature is integrated into the core code, it need not be locally rewritten when upgrading
Virtualization of Infrastructure Problem: Server environments are inevitably unique and idiosyncratic on physically provisioned hardware for development, pre-production, and production Solution: Deploy virtual machines with standardized environments, abstracting away hardware concerns Solution in context: Open Stack, vmware, Vagrant
Virtualization of Infrastructure VMware: a framework for the creation and management of completely virtualized sets of hardware. Vagrant: lightweight, reproducible and portable virtual development environment.
Automation of Deployment Problem: People make mistakes when forced to execute detailed procedures in a hurry, and it’s stressful anyway! Solution: Script the deployment so it is programmatically identical with each execution Solution in context: Chef
Virtualization of Infrastructure Chef: “Code as Infrastructure” – Chef is a framework for the scripted automation of application deployment. It is: Version-able Testable repeatable
Amusing Anecdotes and Takeaways
Big Technical Changes are Expensive Implementing a virtual infrastructure and automating deployment is a huge cultural and technical shift. Many stakeholders have to buy-in to the long-term investment. Lots of work has to be done before any benefit is realized.
Production deployments of DSpace at TAMU is now fast! The work is nearly all front-loaded The production deployment is a “one-click” process, undertaken with a higher degree of certainty
Thanks for coming! Any questions? TAMU Libraries Digital Initiatives James Creel, Micah Cooper, Jeremy Huff TCDL 2015 Austin, TX