Where’s My Data? Using MetriDoc to manage data integration headaches Joe Zucca– Tommy Barker – Sponsored by
The Problem The request seems simple but the solution is complex Generally asked “who did / used x?” which leads to other questions Where’s the data? What’s the grain of the answer? So how do we answer these questions? If lucky, run script / query against a database and generate report If not lucky, build an application to answer the question This is what MetriDoc is built for
Current Solution - Datafarm Datafarm = Crontab + Perl + CGI = Spaghetti Voyager Blackboard COUNTER DLA logs Datafarm Gate Count Ezproxy Penn Community Borrow Direct App 1 App 3 App 2 App n
Datafarm Shortcomings Maintainability issues Not shareable Not reusable
MetriDoc = Datafarm 2.0 As our system grew, we began creating MetriDoc to address Datafarm’s problems Needed a scheduler that was more sophisticated than cron Needed languages that were more maintainable than perl Needed integration tools to simplify data gathering across disparate systems We built prototypes and services to help us evaluate technologies Received a grant from IMLS to speed up development Hired another programmer
MetriDoc Philosophy Keep it simple Sometimes a script is all you need Ease of use is more important than performance Don’t recreate the wheel 100% open source Sharable data
MetriDoc – How it Works MetriDoc’s core is built around database schemas A MetriDoc implementation consists of loading tables and normalized tables Loading tables prime the repository The user is responsible for populating these tables Normalized tables are built from the data in the loading tables MetriDoc takes care of this Conforming to similar schemas provides interesting possibilities Sharing data is easy Sharing a single repository is easy (think amazon web services) Easier to collaborate From a user’s perspective MetriDoc has tools to get your stuff in the loading tables But ultimately you just need to get it in there, so you can use whatever Use the MetriDoc tools to manage your integration needs Useful for getting, transforming / resolving, moving and loading data
MetriDoc – Core Technologies JVM Java is used for infrastructure Groovy is the primary language Master Scheduler Essentially the brains of MetriDoc Using Hudson for now ( Integration Tooling Tooling built on top of Apache Camel ( Helps move data from one place to another Really helpful for batch processing Resolutions / Transformation Tools Patron anonymization, text normalization, resource id to title resolutions, etc.
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 1 – Fill the loading tables Load Ezproxy Load Patron Info Load Counter Hudson Loading Tables Voyager Ezproxy COUNTER
Loading Tables ||Philadelphia||PA||United States||Default+datasets+documents+pwp+vanwert||jsmith||[19/Jan/2011:00:01: ]||GET|| 10X%2329%23266%232&_version=1&md5=8e47306a7f3a7da8a6fe7b521a7a149b||302||0|| cine&volume=29&issue=2&date= &atitle=An+adjuvanted+pandemic+influenza+H1N1+vaccine+provides+early+and+long+term+protection+in+health+care+workers.& spage=266&sid=EBSCO:aph&pid=Madhun%2c+Abdullah+S.%3bAkselsen%2c+Per+Espen%3bSjursen%2c+Haakon%3bPedersen%2c+Gabriel%3bSvindland%2c+Signe%3bN%c3%b 8stbakken%2c+Jane+Kristin%3bNilsen%2c+Mona%3bMohn%2c+Kristin%3bJul- Larsen%2c+%c3%85sne%3bSmith%2c+Ingrid%3bMajor%2c+Diane%3bWood%2c+John%3bCox%2c+Rebecca+J aph||Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv: ) Gecko/ Firefox/3.0.5 (.NET CLR )]||Re07OuEIyQo8X6w||UPennLibrary=AAAAAUkQ36AAAFTaAwO7Ag==; __utma= ; __utmc= ; __utmz= utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn; WRUID=0; __utmv= |1=User-Type=Current%20Students=1,; __utma= ; __utmc= ; __utmz= utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=upenn%20blackboard; hp=/vanpelt/; __utma= ; __utmc= ; __utmz= utmcsr=library.upenn.edu|utmccn=(referral)|utmcmd=referral|utmcct=/biomed/; proxySessionID= ; ezproxy=Re07OuEIyQo8X6w; ARPT=MWPYIPS108CWYL; EHost2=sid=49d81d dbd-b94f- __utmb= ; __utmb= ; __utmb= ; ASPSESSIONIDCCAQQCRC=AHJAGJMDDPNIIMLMHBCPCHBL Patron_idPatron_ipurlRef_urlProxy_idEzproxy_id jsmith http://www…
The Metridoc Solution Metridoc = Hudson + Java / Groovy + Apache Camel = Integration Nirvana Step 2 – Populate the normalized tables Normalize Ezproxy Normalize Patron Info Normalize Counter Hudson Repository Loading Tables
Generally used for building software, but a fantastic cron replacement Can run arbitrary scripts locally and remotely Supports master / slave distribution model seamlessly Can be managed entirely via REST Extensible Helps with job dependencies It is simple and free Active community with a huge collection of plugins Jenkins – Death to Cron
A Little Groovy
The Metridoc Job Framework
Metrics on the Cheap
Where we are….