HathiTrust Large Scale Search Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan www.hathitrust.org/blogs.

Slides:



Advertisements
Similar presentations
Digital Projects from Government Libraries Bernie Gloyn June 1, 2012 Statistics Canada Library Digitization Project.
Advertisements

Multidimensional Index Structures One dimensional index structures assume a single search key, and retrieve records that match a given search-key value.
Setting up repositories: Technical Requirements, Repository Software, Metadata & Workflow. Repository services Iryna Kuchma, eIFL Open Access program manager,
Universal Access to All Internet Archive: Non-Profit Library.
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center
Preservation of the Texas Agricultural Experiment Station Bulletin in the Digital Repository By Dr. Rob McGeachin Texas A&M University Libraries June,
Lucid Imagination, Inc. – 1 Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West.
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
HATHI TRUST A Shared Digital Repository Building A Future By Preserving Our Past The Preservation Infrastructure of HathiTrust Digital Library Jeremy York.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
HathiTrust: Building the Universal Collection John Wilkin 18 May 2009.
Cory Snavely Library IT Core Services manager University of Michigan September 2010.
Building the Universal Library: The Promise and Challenges of HathiTrust John Wilkin 2 April 2009.
What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
HathiTrust and Print Storage Building around a digital core.
HathiTrust: Building the Organization, Building Services Christenson, Burton-West, Chapman, Feeman, Karle- Zenith & Wilkin 4 May 2009.
HATHI TRUST A Shared Digital Repository HathiTrust, Collections, and Collaboration COLD 2011 Spring Meeting Jeremy York May 20, 2011.
KAT HAGEDORN HATHITRUST SPECIAL PROJECTS COORDINATOR UNIVERSITY OF MICHIGAN LIBRARIES OCTOBER 9, 2009 Seamless Sharing: NYU, HathiTrust, ReCAP and the.
Korean Rare Materials Digitization Projects April 2008 Hye Eun LEE Korea Research Institute for Library and Information (KRILI) The.
OCLC Online Computer Library Center Parallel Text Searching on a Beowulf Cluster using SRW Ralph LeVan OCLC Research.
Chapter 3: Utilization and Volume. 26 Chartbook 2000 Community hospital acute care admissions declined 15 percent between 1980 and 1994 and then began.
UKPMC and Dryad Dryad-UK meeting: 28 th April 2010 Robert Kiley, Head Digital Services, Wellcome Library
Mr. Noritada Otaki Deputy Librarian National Diet Library, Japan August 6, 2003 CDNL, Berlin Beginning of a New Era for the National Diet Library : Opening.
Mythbusters Debunking Common SharePoint Farm Misconceptions ITP361 Spencer Harbar.
How Big is an Exabyte? 5 Exabytes = 37,000 new libraries
Digital Collections: Storage and Access Jon Dunn Assistant Director for Technology IU Digital Library Program
HathiTrust Digital Library: Enrich Your Research and Scholarship Doreen Bradley Chris Powell University Library May 2011.
Introduction to Indexes Rui Zhang The University of Melbourne Aug 2006.
HathiTrust and the Ecology of Shared Collections Paul N. Courant 21 May 2009.
IMLS National Leadership Grant: CRMS World Bobby Glushko University of Michigan Copyright Office.
HATHI TRUST A Shared Digital Repository Digital Repositories for Preservation and Access Digital Directions 2013 Jeremy York July 22, 2013 Unless otherwise.
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November.
October 24, 2006Merit Technical Staff Meeting1 The Google Project at the University of Michigan Perry Willett Head, Digital Library Production Service.
Digital archival storage for the University of Michigan Library collections.
DIGITIZATION OF LOCAL HISTORY COLLECTIONS IN PUBLIC LIBRARY “VLADISLAV PETKOVIC DIS” IN CHACHAK: DIGITIZATION OF THE NEWSPAPER “THE VOICE OF CHACHAK” Bogdan.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
HathiTrust Large Scale Search: Scalability meets Usability Tom Burton-West Information Retrieval Programmer Digital Library Production Service University.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
assumes basic arithmetic
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Storage Solutions The use case at the National Library of the.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Utilizing Mind-Maps for Information Retrieval and User Modelling Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp 1www.docear.org Doc – The Academic.
Communications Technology 2104 Mercedes Lahey. Bit 1. bit=From a shortening of the words “binary digit” 2. the basic unit of information for computers.
HathiTrust – How To By Dr. Rob McGeachin 20 th Annual AgNIC Meeting May 7, 2015.
HathiTrust Digital Library. Overview ›Began in 2008 ›Large scale digital preservation repository ›Partnership of major research libraries ›Focus on both.
Mass digitisation? Astrid Verheusen Projectmanager Research & Development Division National library of the Netherlands LIBER-EBLIDA Workshop on Digitisation.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Breana McCracken University of Illinois at Urbana-Champaign HathiTrust and Copyright Future Implications - Strong precedent for libraries to continue to.
Datasets of the KB Steven Claeyssens – 19 September 2013.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
Work performed pupil 8B class: Danil Kozlov Supervisor: Lepeshkina Natalia Valeryevna RussiaVoronezh.
Search Gotchas Sharon Richardson Joining Dots. Indexing Architecture There can be only one… …indexing server.
HATHI TRUST A Shared Digital Repository Use of PREMIS for Internet Archive AIPs September 22, 2010.
K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston
HathiTrust: Collaboration in Building the Universal Collection John Wilkin 1 October 2009.
Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.
HathiTrust: Possibilities Metadata Working Group Cornell University Library March 21, 2014.
HathiTrust--a GovDocs Repository? Brian Vetruba, Catalog Librarian/Germanic Studies Librarian Washington University in St. Louis Leveraging.
HathiTrust Digital Library Interface and Services
Digital Video Library - Jacky Ma.
Text Indexing and Search
Building Search Systems for Digital Library Collections
Making the Migration Moving Legacy Collections into Hydra
Thanks to Bill Arms, Marti Hearst
Kolb Anton Spectrum, Belarus
HathiTrust And Its Research Center
digital archival storage
Presentation transcript:

HathiTrust Large Scale Search Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan October 7th 2010 w ww.hathi trust.org

HathiTrust HathiTrust is a shared digital repository 30+ member libraries Large Scale Search is one of many services built on top of the repository Currently about 6.5 million books 250 Terabytes –Preservation page images;jpeg 2000, tiff (244TB) –OCR and Metadata about (6TB) 3

Challenges Goal: Design a system for full-text search that will scale to 7 million -20 million volumes (at a reasonable cost.) Challenges: –Must scale to 20 million full-text volumes –Very long documents compared to most large-scale search applications –Multilingual collection (400+ languages) –OCR quality varies

Long Documents Average HathiTrust document is 700KB containing over 100,000 words. –Estimated size of 7 million Document collection is 4.5TB. Average HathiTrust document is about 38 times larger than the average document size of 18KB used in Large Research test collections CollectionSizeDocumentsAverage Doc size HathiTrust4.5 TB (projected)7 million700 KB TREC GOV TB25 million18 KB SPIRIT1 TB94 million10 KB NW1000G TB*100 million16 KB

Index Size, Caching, and Memory Our 6 million document index is between 3 and 4 terabytes. –About 450 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory –Solr uses OS memory (disk I/O caching) for caching of postings –Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) 6

Response Time Varies with Query Average: 673 Median: th : th : 7,504

Slowest 5% of queries

Standard index vs. CommonGrams Standard IndexCommon Grams WORD TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) the2, of1, and literature4210 lives2194 generation2199 beat TOTAL4,176 TERM TOTAL OCCURRENCES IN CORPUS (MILLIONS) NUMBER OF DOCS (THOUSANDS) of-the generation the-lives literature-of lives-and and-literature the-beat TOTAL450

CommonGrams Comparison of Response time (ms) averagemedian90th99thslowest query Standard Index , ,595 Common Grams683712,2267,800

Thank You ! Tom Burton-West w ww.hathi trust.org