Collection and Data Overview Jeremy York Stacy Kowalczyk.

Slides:



Advertisements
Similar presentations
Technical Highlights 25th August 2011 Sebastian Peters German National Library of Science and Technology.
Advertisements

Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.
Data Publishing Service Indiana University Stacy Kowalczyk April 9, 2010.
Digital Preservation A Matter of Trust. Context * As of March 5, 2011.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
HathiTrust Research Center Architecture
May 17, 2011 DPLA Global Interoperability and Linked Data Workshop Building a Public Research Center for the HathiTrust Digital Library Robert H. McDonald.
Elibrary.worldbank.org World Bank eLibrary User Guide Take full advantage of your eLibrary subscription!
Elephant in the Room: Scaling Storage for the HathiTrust Research Center Robert H. McDonald Associate Dean for Library Technologies Deputy.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
PAWN: Producer-Archive Workflow Network University of Maryland Institute for Advanced Computer Studies Joseph JaJa, Mike Smorul, Mike McGann.
The Cornell Veterinarian A Metadata Perspective.
The Hathi Trust Research Center and tool builders John Unsworth (with Beth Plale, Scott Poole, Robert McDonald, and others) Project Bamboo Corpora Space.
Digital Library Architecture and Technology
IT 210 The Internet & World Wide Web introduction.
HathiTrust – How To By Dr. Rob McGeachin 20 th Annual AgNIC Meeting May 7, 2015.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Link Resolvers: An Introduction for Reference Librarians Doris Munson Systems/Reference Librarian Eastern Washington University Innovative.
Using the SAS® Information Delivery Portal
OASIS ebXML Registry Standard Open Forum 2003 on Metadata Registries 10:30 – 11:15 January 20, 2003 Kathryn Breininger The Boeing Company Chair, OASIS.
HTRC API Overview Yiming Sun. HTRC Architecture Data API Portal access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2)
From Concept to Reality: An overview of the University of Wisconsin Digital Collections Melissa Mclimans.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
OpenURL Link Resolvers 101
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
HTRC Workshop 101 THATCamp Gainesville April 24, 2014.
SEASR Applications and Future Work University of Illinois at Urbana-Champaign.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
Javascript Cog Kit By Zhenhua Guo. Grid Applications Currently, most grid related applications are written as separate software. –server side: Globus,
HATHI TRUST RESEARCH CENTER Building Collections and Analyzing Data Stacy Kowalczyk.
HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
HathiTrust Research Center Architecture Data subsystem.
HathiTrust’s Past, Present and Future. Short- and Long-term Functional Objectives Short-term Page turner mechanism (and Mobile!) Branding (overall initiative;
Author(s): Jeremy York, 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution–Noncommercial–Share.
DNER Architecture Andy Powell 6 March 2001 UKOLN, University of Bath UKOLN is funded by Resource: The Council for.
Accessing a national digital library: an architecture for the UK DNER Andy Powell ELAG 2001, Prague 7 June 2001 UKOLN, University of Bath
Accessing HTRC Data. What is Hathitrust Research Center? A collaborative research center launched jointly by Indiana University and the University of.
Extending Access To Information Resource Discovery Service William E. Moen, Ph.D. Kathleen R. Murray, Ph.D. School of Library and Information Sciences.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic library materials.
MOODy :) Investigations into Massive Open Online Discovery at IU Juliet Hardesty Courtney Greene McDonald Bryan J Brown
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
DSpace - Digital Library Software
Overviews of the Library of Texas & ZLOT Project Dr. William E. Moen Principal Investigator.
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
DSpace System Architecture 11 July 2002 DSpace System Architecture.
HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
HathiTrust: Possibilities Metadata Working Group Cornell University Library March 21, 2014.
OASIS ebXML Registry Standard Open Forum 2003 on Metadata Registries 10:30 – 11:15 January 20, 2003 Kathryn Breininger The Boeing Company Chair, OASIS.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The Data Capsule for Non-Consumptive Research Beth Plale, Atul Prakash, Geoffrey Fox, Robert H. McDonald A Proposal to the Alfred P. Sloan Foundation HTRC.
The world’s libraries. Connected. CONTENTdm ® Digital Collection Management Solutions Learn what to consider when outsourcing your library’s digitization.
HathiTrust--a GovDocs Repository? Brian Vetruba, Catalog Librarian/Germanic Studies Librarian Washington University in St. Louis Leveraging.
Alexandria Digital Library ADL Metadata Architecture Greg Janée.
ArcGIS for Server Security: Advanced
VIRTA Publication Information Service
What’s next with the HathiTrust Research Center?
Summon discovers contents from one search box!
Overview: Fedora Architecture and Software Features
WorldCat: Broad Web visibility for our collection
HathiTrust And Its Research Center
Presentation transcript:

Collection and Data Overview Jeremy York Stacy Kowalczyk

HATHITRUST A Shared Digital Repository HathiTrust Data Overview September 10, 2012 Jeremy York Project Librarian, HathiTrust

Content and Metadata

Outline Content and Metadata – Data formats Repository Organization Data availability – Availability mechanisms – Rights and agreements

Content Books and journals – Pilots around images, audio, born-digital Digitization sources – Google (96.8%, 10,162,104) – Internet Archive (2.9%, 301,972) – Local (0.3%, 31,840)

Content Sources

Content Package images bib data bib data Source METS text HT METS Zip

Content Package images bib data bib data Source METS text HT METS Zip

Metadata Bibliographic Structural Rights Administrative (preservation) Holdings

Repository Organization

Source Bibliographic Data Content Package Indiana Michigan Bib Data Data Management Rights Data Storage Access Ingest Catalog Full-text Search PageTurner APIs Collections Holdings Data Datasets

File System images bib data bib data Source METS text HT METS../uc1/pairtree_root/b3/54/34/86/b b zip b mets.xml Example ids: wu mdp uc2.ark:/1390/t miua.aaj

Data Availability

APIs Data API – Zip package – Single page images or OCR – Volume and rights metadata (XML) Bib API (JSON) – Volume and rights metadata – MarcXML

Data Feeds OAI – MarcXML – Dublin Core Hathifiles – Tab-delimited inventory files – Contain Identifiers Limited bibliographic information Rights, language, gov docs status information

Datasets Content Bibliographic data Content organization

Content Distribution

HathiTrust Research Collection Overview Stacy Kowalczyk

The HTRC Collection Public Domain Materials of the HatihTrust – 2,592,097 Volumes – Gigabytes 2.3 TB in raw OCR’d text 3.7 TB of managed OCR’d text 1.85 TB solr Index – Monthly Updates And irregular data ‘take down’ requests

Exploring the Collection Publication Data – Date of publication – Country – Publisher Language Topical Coverage Authors

Publication Dates 2,562,283 Bib records with pub dates

Country of Publication – 244 different countries of publication – 2,578,341 bib records – 400,000 records have more than one country of publication – The top 11 countries accounted for nearly 90% – 229 counties accounted for 6% – Unknown country indicated 5%

Country of Publication

Topical Coverage Call numbers – 335,446 unique call numbers – 691,131 bib records Topic Strings – 589,428 unique subject headings – 1,948,999 bib records – 2,315,070 occurrences

Call Number Distribution

Standard Numbers SuDocs – 117,095 unique SuDoc numbers – 259,718 bib records ISBN – 23,765 ISBN numbers – 34,855 bib records ISSN – 8,658 unique ISSN numbers – 234,554 bib records OCLC numbers – 434,589 unique OCLC number – 1,112,499 bib records LCCN – 432,563 unique LCCN – 1,104,696 bib records

Authors 849,753 unique author strings 2,41,0,788 bibliographic records Organized into subcategories – US governmental agencies – US state and local governments – Foreign country and city governments – Companies – Associations/societies – Academic Institutions, Libraries, Museums – Individual Authors

Authors

Collection Access Known item – Title – Author – Standard number Key word access – All words in OCR’d text – All words in bibliographic data Sparsely populated data

To Learn More Sessions tomorrow Data in Detail – Jeremy York and J. Stephen Downey – 9:30 am Main Lobby/Atrium – 1 pm Main Lobby/Atrium Building Collections and Analyzing Data – 1 pm Flex Lab 005

HATHITRUST A Shared Digital Repository HathiTrust Research Center Architecture Overview Robert H. McDonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data to Insight Center Associate Dean-University Libraries Indiana University

Follow Along

HTRC Architecture Group Indiana University Beth Plale, Lead Yiming Sun Stacy Kowalczyk Aaron Todd Jiaan Zeng Guangchen Ruan Zong Peng Swati Nagde University of Illinois J. Stephen Downie Loretta Auvil Boris Capitanu Kirk Hess Harriett Green

Presentation Overview Considerations for Current Architecture Architecture - Use Case Methodology Technical Overview UnCamp Sessions for Further Review

Main Case – Data Near Computation HT Volume Store (UM) HT Volume Store (IUPUI) HT Volume Store (IUPUI) HTRC Volume Store and Index (IUB) FutureGrid Computation Cloud FutureGrid Computation Cloud XSEDE Compute Allocation IU Compute Allocation IU Compute Allocation UIUC Compute Allocation

Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

Amicus Brief and NCR Jockers, Sag, Schultz – h ttp://tinyurl.com/cy34hhr

Use Cases for Phase 1 Architecture Use Case #1 - Previously registered user submitted algorithm retrieved and run with results set Use Case #2 - HTRC applications/portal access (SEASR) Use Case #3 – Blacklight Lucene/Solr faceted access Use Case #4 - Direct programmatic access through Secure Data API (right now only for UnCamp and open content)

HTRC Current Infrastructure Servers – 14 production-level quad-core servers 16 – 32GB of memory 250 – 500GB of local disk each – 6-node Cassandra cluster for volume store – Ingest service and secure Data API access point Storage (IU University Infrastructure) – 13TB of 15,000 RPM SAS disk storage – Increase up to 17TB by end of 2012 – 500TB available in late year 2-year 3

Key Components of Architecture Portal Access Blacklight Access Agent Registry Secured Data API Access Solr Proxy

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Portal Access HTRC Portal Blacklight App SEARApp Blacklight

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Agent HTRC Agent Job Submission Collection building

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy 11 HTRC Registry Algorithms Result Sets Meandre Workflows Registry (WSO2) Collections

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy Secure Data API RESTful Web Service – Language agnostic – Clients don’t have to deal with Cassandra Simple OAuth2 authentication HTTP over SSL Audits client access Protected behind firewall, accessible only to authorized IPs HTRC

HTRC Architecture Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy RFS distributed file system Solr proxy Solr service

NoSQL Methodology Currently HT content is stored in a pair-tree file system convention (CDL) Moving these files into a NoSQL store like Cassandra enabled HTRC to aggregate them into larger sets of files for use in retrieval Use of Cassandra enabled HTRC to share content over a commodity based Cassandra cluster of virtual machines Originally investigated use of MongoDB, CouchDB, Hbase and Cassandra

HTRC Solr Proxy + Solr Service Preserves all query syntax of original Solr Prevents user from modification Hides the host machine and port number HTRC Solr is actually running on Creates audit log of requests Provides filtered term vector for words starting with user-specified letter Filters out “dangerous” requests to Solr Adds additional features to Solr – E.g. Term Vectors

Data Capsules VM Cluster Provide secure VM Scholars Remote Desktop Or VNC Submit secure capsule map/reduce Data Capsule images to FutureGrid. Receive and review results FutureGrid Computation Cloud FutureGrid Computation Cloud HTRC Volume Store and Index Non-Consumptive Research-Secure Data Capsule

Sessions for Further Review For more on Secure Data API – Tues Topic I/II (Yiming Sun) For more on Portal/SEASR – Tues Topic II (Loretta Auvil) For more on Portal/Blacklight – Tues Topic III (Stacy Kowalczyk)

Contact Information Robert H. McDonald – – – Chat – rhmcdonald on googletalk | skype – Twitter – Blog – – Twitter Hashtag: #HTRC12