Presentation is loading. Please wait.

Presentation is loading. Please wait.

HathiTrust Research Center Architecture Data subsystem.

Similar presentations


Presentation on theme: "HathiTrust Research Center Architecture Data subsystem."— Presentation transcript:

1 HathiTrust Research Center Architecture Data subsystem

2 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

3 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

4 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

5 Solr quick introduction Lucene is a high-performance, full-featured text search engine library Solr is a web service frontend to Lucene Index consists of documents and document consists of fields which are name/value pair

6 HTRC Solr Has both bibliographic information and full-text OCR scan – 29 fields – volume ID, title, author, several reference IDs (ISBN, ISSN, callnumber, etc), and full text Basic search like term query, wildcard, fuzzy query, phrase query and range query: – Example: “OCR: war”, search documents containing the word “war” in text Term Vector is enabled to get word frequency and offset for each word : – Occurences – position and offset

7 Filtered Term Vector Default Term Vector is massive – O(5MB) per volume – Extremely slow response for multiple volumes We extended Solr to filter unwanted words to enhance response speed significantly. – Reduced term vector size to O(80KB) per volume.

8 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

9 Ingest Procedure Use rsync to pull filesystem data from HT main collection. Too many small text files... Parse structural metadata (METS) – ordering of page, page checksum (and verification); some metadata stored to NoSQL. Analyze delta logs to push incremental changes to NoSQL store

10 Bib metdata Collection namespace 1 Collection namespace 2 … pairtree_root pairtree Rsync root pairtree Rsync split pairtree list Rsync root Parallel rsync of the rest using split tree list … … Bib metdata Collection namespace 1 Collection namespace 2 … pairtree_root pairtree … … Split pairtree list Delta logs Push modified volume contents from pairtree to noSQL Cassandra noSQL repository Update collections list HathiTrust (remote) HathiTrust Research Center (local) HTRC Text Corpora Ingest Workflow

11 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

12 NoSQL Repository Utilizing Cassandra as a storage space for our text collections and related metadata – Aggregates small texts Allows us to manage flexible schemas Key-value based column store Offers good scalability, redundancy, and performance

13 Cassandra Schema Each row represents a volume – Row key is the volume ID – Each row contains many columns – First column contains metadata attributes about the volume – Each subsequent column family is a page, key is page ID – Page-specific columns contain page contents and metadata about the page Key: (volume ID) Inu.320001 metadata copyright public Page count 16 Inu.320001/001 content What’s up doc? size 12 MD5 12345f Inu.320001/xxx content Rabbits size 7 MD5 aabbcc Inu.320002 metadata copyright In-copyright Page count 2406 Inu.320002/001 content 2b|!2b size 6 MD5 7effdd Inu.320002/xxx content A question size 10 MD5 deadbeef …

14 Cassandra Schema Pros – Works well for all access primitives – Well organized metadata – no repetitions – Volume level versioning could follow similar schema, but version number needs to be concatenated to volume ID for historical versions Cons – Subcolumn families cannot be indexed – Extra metadata are picked up even when only page contents are needed – Must store historical versions of volumes as deltas; naïve translation of the above format to historical versioning would have high cost in space Key: (volume ID) Inu.320001 metadata copyright public Page count 16 Inu.320001/001 content What’s up doc? size 12 MD5 12345f Inu.320001/xxx content Rabbits size 7 MD5 aabbcc Inu.320002 metadata copyright In-copyright Page count 2406 Inu.320002/001 content 2b|!2b size 6 MD5 7effdd Inu.320002/xxx content A question size 10 MD5 deadbeef …


Download ppt "HathiTrust Research Center Architecture Data subsystem."

Similar presentations


Ads by Google