HathiTrust Research Center Architecture Data subsystem.

Slides:

Advertisements

Similar presentations

Copyright © 2007 Vangent, Inc. All Rights Reserved. Example of OOR Architecture Open Ontology Repository Architecture – Some Considerations April 28-29,

Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.

HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.

HathiTrust Research Center Architecture

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Jennifer Widom NoSQL Systems Overview (as of November 2011 )

Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.

DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)

Chapter 4 Database Management Systems. Chapter 4Slide 2 What is a Database Management System (DBMS)?  Database An organized collection of related data.

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

The Cornell Veterinarian A Metadata Perspective.

GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Russ Houberg Senior Technical Architect, MCM KnowledgeLake, Inc.

Implementing search with free software An introduction to Solr By Mick England.

1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

1 Overview SUNY Business Intelligence Initiative (SBII) Library Dashboards Circulation Analysis Collection Analysis.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

HTRC API Overview Yiming Sun. HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2)

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

WTT Workshop de Tendências Tecnológicas 2014

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Revolutionizing enterprise web development Searching with Solr.

1 What’s the difference between DocuShare 3.1 and 4.0?

Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.

SEASR Applications and Future Work University of Illinois at Urbana-Champaign.

Real World Case Study KM Summer Institute June Rano Joshi, Vorsite.

Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.

Example of OOR Architecture Open Ontology Repository Architecture – Some Considerations March, 2008 Dr. Ravi Sharma Senior Enterprise Architect Technology.

HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.

Collection and Data Overview Jeremy York Stacy Kowalczyk.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.

Nov 2006 Google released the paper on BigTable.

NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.

HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign

Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.

HathiTrust: Possibilities Metadata Working Group Cornell University Library March 21, 2014.

Bigtable: A Distributed Storage System for Structured Data

JourneyTEAM - – Folders, the F Word of Document Management Adam Burden

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.

Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

Not Your Father’s Laserfiche AA101 Michael Allen.

Solr Power FTW Alex #solrnosql. What Will I Cover? Who I am What Bazaarvoice does SOLR and NoSQL Can SOLR handle 20K queries per second?

CS 405G: Introduction to Database Systems

GPIR GridPort Information Repository

Building Search Systems for Digital Library Collections

NoSQL Systems Overview (as of November 2011).

CS6604 Digital Libraries IDEAL Webpages Presented by

敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)

Presentation transcript:

HathiTrust Research Center Architecture Data subsystem

Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

Solr quick introduction Lucene is a high-performance, full-featured text search engine library Solr is a web service frontend to Lucene Index consists of documents and document consists of fields which are name/value pair

HTRC Solr Has both bibliographic information and full-text OCR scan – 29 fields – volume ID, title, author, several reference IDs (ISBN, ISSN, callnumber, etc), and full text Basic search like term query, wildcard, fuzzy query, phrase query and range query: – Example: “OCR: war”, search documents containing the word “war” in text Term Vector is enabled to get word frequency and offset for each word : – Occurences – position and offset

Filtered Term Vector Default Term Vector is massive – O(5MB) per volume – Extremely slow response for multiple volumes We extended Solr to filter unwanted words to enhance response speed significantly. – Reduced term vector size to O(80KB) per volume.

Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

Ingest Procedure Use rsync to pull filesystem data from HT main collection. Too many small text files... Parse structural metadata (METS) – ordering of page, page checksum (and verification); some metadata stored to NoSQL. Analyze delta logs to push incremental changes to NoSQL store

Bib metdata Collection namespace 1 Collection namespace 2 … pairtree_root pairtree Rsync root pairtree Rsync split pairtree list Rsync root Parallel rsync of the rest using split tree list … … Bib metdata Collection namespace 1 Collection namespace 2 … pairtree_root pairtree … … Split pairtree list Delta logs Push modified volume contents from pairtree to noSQL Cassandra noSQL repository Update collections list HathiTrust (remote) HathiTrust Research Center (local) HTRC Text Corpora Ingest Workflow

Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

NoSQL Repository Utilizing Cassandra as a storage space for our text collections and related metadata – Aggregates small texts Allows us to manage flexible schemas Key-value based column store Offers good scalability, redundancy, and performance

Cassandra Schema Each row represents a volume – Row key is the volume ID – Each row contains many columns – First column contains metadata attributes about the volume – Each subsequent column family is a page, key is page ID – Page-specific columns contain page contents and metadata about the page Key: (volume ID) Inu metadata copyright public Page count 16 Inu /001 content What’s up doc? size 12 MD f Inu /xxx content Rabbits size 7 MD5 aabbcc Inu metadata copyright In-copyright Page count 2406 Inu /001 content 2b|!2b size 6 MD5 7effdd Inu /xxx content A question size 10 MD5 deadbeef …

Cassandra Schema Pros – Works well for all access primitives – Well organized metadata – no repetitions – Volume level versioning could follow similar schema, but version number needs to be concatenated to volume ID for historical versions Cons – Subcolumn families cannot be indexed – Extra metadata are picked up even when only page contents are needed – Must store historical versions of volumes as deltas; naïve translation of the above format to historical versioning would have high cost in space Key: (volume ID) Inu metadata copyright public Page count 16 Inu /001 content What’s up doc? size 12 MD f Inu /xxx content Rabbits size 7 MD5 aabbcc Inu metadata copyright In-copyright Page count 2406 Inu /001 content 2b|!2b size 6 MD5 7effdd Inu /xxx content A question size 10 MD5 deadbeef …