Download presentation
Presentation is loading. Please wait.
Published byAmberlynn Taylor Modified over 9 years ago
1
HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign lauvil@illinois.edu, capitanu@ncsa.uiuc.edu
2
Outline HTRC Overview HTRC Workset Creation –BlackLight HTRC Analysis –Tagclouds –Dunning Loglikelihood Comparison –Entity Extraction
3
HTRC Overview Provides research access to the public domain text of the HathiTrust Digital Library. Collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library Helps meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital materials Provides an infrastructure to search, collect, analyze, and visualize the full text of nearly 3 million public domain works and is intended for nonprofit and educational researchers.
4
The HTRC Collection Public Domain Materials of the HathiTrust (Sept 2012) –2,592,097 Volumes –Gigabytes 2.3 TB in raw OCR’d text 3.7 TB of managed OCR’d text 1.85 TB solr Index –Monthly Updates And irregular data ‘take down’ requests
5
Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy HTRC Architecture
6
HTRC Links FAQ –http://wiki.htrc.illinois.edu/display/COM/HTRC+User+Getting+Sta rted+FAQhttp://wiki.htrc.illinois.edu/display/COM/HTRC+User+Getting+Sta rted+FAQ Portal –http://htrc2.pti.indiana.eduhttp://htrc2.pti.indiana.edu –The portal allows you to browse volume lists and algorithms, execute algorithms, and view results. Blacklight –http://htrc2.pti.indiana.edu/blacklighthttp://htrc2.pti.indiana.edu/blacklight –The Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms. It provides a GUI interface to our Solr index.
7
Summary of Steps Create a workset Select algorithm –Provide job name –Select workset –Adjust parameters Run View results
8
Workset Creation Based on the Blacklight open source search interface Provides a familiar search interface to the entire HTRC collection Searches based on full text, author, title, or subject Allows for creating/updating of worksets
9
Blacklight: Workset Creation Search
10
Blacklight: Search Results Search based on full text, author, title, or subject Facets on left to further filter results
11
Blacklight: Manage Worksets Modify an existing workset that you own Remove a workset
12
Blacklight: Create Workset Click on “Selected Items (X)” from top right menu Update existing workset or create a new one
13
HTRC Analysis Current list of Algorithms
14
Algorithm Info Meandre_Topic_Modeling <param name="input_collection" type="collection" required="true"> Please select a collection for analysis The collection containing the volume ids to be used for analysis.
15
Algorithm Execution run_HTRC_Meandre_Topic_Modeling.sh HTRC_Meandre_Topic_Modeling.properties $input_collection
16
Algorithm Results
17
HTRC Algorithm UI
18
Meandre Flow Encapsulation and integration environment for tools and algorithms
19
Results
20
Tagcloud Results
21
Tagcloud with Cleaning in HTRC
22
Tagcloud with Cleaning Results in HTRC
23
Examples on the Wall at UnCamp 2012
24
Dunning Loglikelihood Comparison
25
Dunning Loglikelihood Words more frequent in Othello than Shakespeare’s other tragedies
26
Dunning in HTRC
27
Dunning Loglikelihood Comparison Comparisons ran in SEASR with words (not lemmas) ignoring proper nouns, not equal comparison, but individual documents instead of the collection. Tagclouds show words more common in Othello Othello – Shakespeare Tragedies Othello – HamletOthello – MacBeth
28
Dunning Loglikelihood Results in HTRC Notice the similarity of Wordhoard results
29
Meandre Flow
30
Dunning Loglikelihood References http://gautam.lis.illinois.edu/monkmiddleware/public/an alytics/dunnings.html http://wordhoard.northwestern.edu/userman/analysis- comparewords.html http://sappingattention.blogspot.com/2011/10/comparin g-corpuses-by-word-use.html http://bookworm.culturomics.org/ http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf http://tdunning.blogspot.com/2008/03/surprise-and- coincidence.html
31
Entity Extraction
32
Flow for Dates to Simile
33
Date Extraction to Simile Timeline
34
Date to Simile in HTRC
35
Date to Simile Results in HTRC
36
HTRC Notes One hour time limit that expires from the time you login You must save to ensure results are retained through a restart of the service
37
Demonstration HTRC Blacklight HTRC Portal –Tagclouds –Date Entities to Simile –Dunning Loglikelihood
38
Learning Exercises (1) 1.Sign up for HTRC Account A.Go to Portal: https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/ B.Click “SIGN-UP” C.Complete the form, Click “Submit”
39
Learning Exercises (2) 2.Click on “Worksets” A.Click on “Create a New Workset” button, which opens blacklight user interface B.Click “Login” and follow steps C.Enter a search term and click “Search” button D.Make some selections by clicking the box at the far right to select E.Click “Selected Items..” from the menu bar at top right F.Click “Create/Update Workset” from bottom left in gray box G.Complete form and click “Create” button
40
Learning Exercises (3) 3.Run Tagcloud Algorithm A.Go to Portal: https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/ B.Click on “Algorithms” C.Click on “Meandre_Tagcloud” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Click “Submit” button D.Once Job finishes, select Job Name E.View Results by clicking on “tagcloudtokencounts.html”
41
Learning Exercises (4) 4.Run Meandre_Tagcloud_with_Cleaning Algorithm A.Click on “Algorithms” B.Click on “Meandre_Tagcloud_with_Cleaning” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “tagcloudcleantokencounts.html”
42
Learning Exercises (5) 5.Run Meandre_Dunning_LogLikelihood_to_Tagcloud Algorithm A.Click on “Algorithms” B.Click on “Meandre_Dunning_LogLikelihood_to_Tagcloud” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Select a collection for analysis (required) 4.Select a collection for reference (required) 5.Adjust Additional Parameters (optional) 6.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “dunningtagcloud.html”
43
Learning Exercises (6) 4.Run Meandre_OpenNLP_Entities_List Algorithm A.Click on “Algorithms” B.Click on “Meandre_OpenNLP_Entities_List” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) a.Provide a comma separated list of entity types to be extracted 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “named_entities_list.html”
44
Learning Exercises (7) 4.Run Meandre_OpenNLP_Date_Entities_To_Simile A.Click on “Algorithms” B.Click on “Meandre_OpenNLP_Date_Entities_To_Simile” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) a.Provide a minimum year for filtering. (default: 1800) b.Provide a maximum year for filtering. (default: 1900) 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “date_entity_simile.html”
45
Attendee Project Plan Study/Project Title Team Members and their Affiliation Procedural Outline of Study/Project –Research Question/Purpose of Study –Data Sources –Analysis Tools Activity Timeline or Milestones Report or Project Outcome(s) Ideas on what your team needs from SEASR staff to help you achieve your goal. Identify Research Question Identify Research Question
46
Discussion Questions What analytical tools or applications do you want to utilize with HT data?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.