Download presentation
Presentation is loading. Please wait.
Published byLiberty Gore Modified over 9 years ago
1
Information Analysis at Scale: HathiTrust Research Center Beth Plale Director, Data to Insight Center Co-Director, HathiTrust Research Center November 12, 2012
2
HathiTrust Objective: Contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge. Launched in October 2008 – University of Michigan – Indiana University Expanded to include content from – CIC Member Libraries – UC System Libraries – Internet Archive – Now includes more than 70 partner institutions 2
3
HathiTrust Research Center Mission: provision of computational access to comprehensive body of published works for scholarship and education. HTRC is founded as a joint venture between Indiana University and the University of Illinois Urbana-Champaign, aimed at solving the difficult challenges of increasing computational access to the public domain and copyrighted material in HathiTrust. 3
4
HTRC Long-term goals Support innovation in Cyberinfrastructure to deliver optimal access and use of HathiTrust corpus. Implement “Non-consumptive” research: a technical and intellectual challenge Identify and host existing data analysis, text mining and retrieval tools that are of interest to the community. Stimulate development of new analytical methods and tools. We hope that the scale of the HTRC will promote new levels of collaboration in tool development. 4
5
The HTRC Collection Public Domain Materials of the HathiTrust – 2,592,097 Volumes – Data Size 2.3 TB in raw OCR’d text 3.7 TB of managed OCR’d text 1.85 TB Solr Index – Monthly Updates And irregular data delete requests 5
6
Primary Functions To Support Data discovery – Searching – Creating and saving collections Service Discovery – Algorithms, analyses, and processes – Parameters and outputs Data storage and retrieval Computational resources APIs for controlled access to data and services 6
7
Experiment: Large Scale Data Analysis on XSEDE
8
Experimental Environment and Results Dataset 2,592,210 volumes, in total 2.1 TB, divided into 1024 partitions of 2GB each Computation platform XSEDE Blacklight, 1024-core of each 2.27 GHz, 8192 GB memory. Each core processes one partition Results Whole corpus word count finished in 1,454 seconds or 24.23 minutes
9
Computation Time Distribution
10
Word Frequency Distribution
11
HTRC UnCamp Sept 2012
12
DEMO 12
13
13
14
Issues in Developing Research Collections 14
15
15
16
16
17
17
18
18
19
19
20
20
21
Issues in Developing Research Collections Search Methods – Known item search via fielded bibliographic data Title Author Standard number – Sparsely populated data – Full text access All words in OCR’d text All words in bibliographic data 21
22
OCR Errors in Public Domain Corpus Kinds of errors – Hyphens – Rules and Dictionaries – Pagination – Multiple language texts – Page headers, footers, marginalia – Non-text elements – Tables and highly structured document elements – Imbedded images – Scientific notations – Book design decorative elements 22
23
Automatic Text Correction 99,427 human expert verified rules were applied 256,000 volumes – 70 million pages – 270 words per page – 19 billion words – 125 million words required correction Total percentage of words needing correction was 0.65%. Probability that any volume has one or more errors was 84.9%. 23
24
Automatic Correction Details Average hit rate per page = 1.898 Average hit rate per word = 0.006 Total Bytes Read = 121 GB Total Execution Time = 3.80 hrs 24
25
Problem Examples 25 Hyphens: 1.Combining words 2.Crossing pages 3.Page headers
26
Issues with Text Original A D 1817 57°GEO III C xxix 653 the faicl Commifiioners or Truftees or other Perfons as aforefaid to iffue their Preceptor Precepts to the Sheriff or Sheriffs or Bailiff or other proper Officer of the City Borough or County wherein fuch parochial or other Diftrift hall be fituate Rules-based Correction A D 1817 57°GEO III C xxix 653 the faicl Commifiioners or #trustees# or other #persons# as #aforesaid# to #issue# their Preceptor Precepts to the Sheriff or Sheriffs or Bailiff or other proper Officer of the City Borough or County wherein #such# parochial or other #district# hall be #situate# 26
27
HTRC Research Sandbox A small subset of the collection – Bibliographic data (in MarcXML) – Images – OCR text 250,000 non-Google digitized public domain volumes (or the 35,000 IU digitized public domain collection) Image training sets – Clean page images and text – Bad page images and text 27
28
Metadata as Research Object An IU researcher is interested in publishing trends from 1700 – 1899. – Publication dates – Publishers – Publication location A research team from UWashington is interested in ontology developments – Subject headings – Terms in notes and other fields 28
29
Possible Metadata Research Using the full text and images to identify – Content metadata Topics, keywords, and subjects Genre Publication data Additional characteristics for content identification and author demographics – Structural component identification Chapters (first page of each chapter) Index Preface Table of Contents Genre as structure 29
30
Possible Text Research OCR correction algorithms – Era based rules Dictionary Hyphenation – Page awareness Page numbers Running heading Cross page hyphenation issues New OCR algorithms – Era based rules – Identification of non-text elements – Page awareness 30
31
plale@cs.indiana.edu http://d2i.indiana.edu Visit http://www.hathitrust.org/htrc for more informationhttp://www.hathitrust.org/htrc 31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.