Presentation is loading. Please wait.

Presentation is loading. Please wait.

HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign

Similar presentations


Presentation on theme: "HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign"— Presentation transcript:

1 HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign lauvil@illinois.edu, capitanu@ncsa.uiuc.edu

2 Outline HTRC Overview HTRC Workset Creation –BlackLight HTRC Analysis –Tagclouds –Dunning Loglikelihood Comparison –Entity Extraction

3 HTRC Overview Provides research access to the public domain text of the HathiTrust Digital Library. Collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library Helps meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital materials Provides an infrastructure to search, collect, analyze, and visualize the full text of nearly 3 million public domain works and is intended for nonprofit and educational researchers.

4 The HTRC Collection Public Domain Materials of the HathiTrust (Sept 2012) –2,592,097 Volumes –Gigabytes 2.3 TB in raw OCR’d text 3.7 TB of managed OCR’d text 1.85 TB solr Index –Monthly Updates And irregular data ‘take down’ requests

5 Data API access interface Portal Access Direct programmatic access (by programs running on HTRC machines) Security (OAuth2) Audit Cassandra cluster volume store Solr index Algorithms Result Sets Meandre Workflows Registry (WSO2) Compute resources Storage resources Agent Job Submission Collection building Collections Blacklight Solr Proxy HTRC Architecture

6 HTRC Links FAQ –http://wiki.htrc.illinois.edu/display/COM/HTRC+User+Getting+Sta rted+FAQhttp://wiki.htrc.illinois.edu/display/COM/HTRC+User+Getting+Sta rted+FAQ Portal –http://htrc2.pti.indiana.eduhttp://htrc2.pti.indiana.edu –The portal allows you to browse volume lists and algorithms, execute algorithms, and view results. Blacklight –http://htrc2.pti.indiana.edu/blacklighthttp://htrc2.pti.indiana.edu/blacklight –The Blacklight search interface allows you to search for volumes, and create volume lists that can be used by algorithms. It provides a GUI interface to our Solr index.

7 Summary of Steps Create a workset Select algorithm –Provide job name –Select workset –Adjust parameters Run View results

8 Workset Creation Based on the Blacklight open source search interface Provides a familiar search interface to the entire HTRC collection Searches based on full text, author, title, or subject Allows for creating/updating of worksets

9 Blacklight: Workset Creation Search

10 Blacklight: Search Results Search based on full text, author, title, or subject Facets on left to further filter results

11 Blacklight: Manage Worksets Modify an existing workset that you own Remove a workset

12 Blacklight: Create Workset Click on “Selected Items (X)” from top right menu Update existing workset or create a new one

13 HTRC Analysis Current list of Algorithms

14 Algorithm Info Meandre_Topic_Modeling <param name="input_collection" type="collection" required="true"> Please select a collection for analysis The collection containing the volume ids to be used for analysis.

15 Algorithm Execution run_HTRC_Meandre_Topic_Modeling.sh HTRC_Meandre_Topic_Modeling.properties $input_collection

16 Algorithm Results

17 HTRC Algorithm UI

18 Meandre Flow Encapsulation and integration environment for tools and algorithms

19 Results

20 Tagcloud Results

21 Tagcloud with Cleaning in HTRC

22 Tagcloud with Cleaning Results in HTRC

23 Examples on the Wall at UnCamp 2012

24 Dunning Loglikelihood Comparison

25 Dunning Loglikelihood Words more frequent in Othello than Shakespeare’s other tragedies

26 Dunning in HTRC

27 Dunning Loglikelihood Comparison Comparisons ran in SEASR with words (not lemmas) ignoring proper nouns, not equal comparison, but individual documents instead of the collection. Tagclouds show words more common in Othello Othello – Shakespeare Tragedies Othello – HamletOthello – MacBeth

28 Dunning Loglikelihood Results in HTRC Notice the similarity of Wordhoard results

29 Meandre Flow

30 Dunning Loglikelihood References http://gautam.lis.illinois.edu/monkmiddleware/public/an alytics/dunnings.html http://wordhoard.northwestern.edu/userman/analysis- comparewords.html http://sappingattention.blogspot.com/2011/10/comparin g-corpuses-by-word-use.html http://bookworm.culturomics.org/ http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf http://tdunning.blogspot.com/2008/03/surprise-and- coincidence.html

31 Entity Extraction

32 Flow for Dates to Simile

33 Date Extraction to Simile Timeline

34 Date to Simile in HTRC

35 Date to Simile Results in HTRC

36 HTRC Notes One hour time limit that expires from the time you login You must save to ensure results are retained through a restart of the service

37 Demonstration HTRC Blacklight HTRC Portal –Tagclouds –Date Entities to Simile –Dunning Loglikelihood

38 Learning Exercises (1) 1.Sign up for HTRC Account A.Go to Portal: https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/ B.Click “SIGN-UP” C.Complete the form, Click “Submit”

39 Learning Exercises (2) 2.Click on “Worksets” A.Click on “Create a New Workset” button, which opens blacklight user interface B.Click “Login” and follow steps C.Enter a search term and click “Search” button D.Make some selections by clicking the box at the far right to select E.Click “Selected Items..” from the menu bar at top right F.Click “Create/Update Workset” from bottom left in gray box G.Complete form and click “Create” button

40 Learning Exercises (3) 3.Run Tagcloud Algorithm A.Go to Portal: https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/https://htrc2.pti.indiana.edu/HTRC-UI- Portal2/ B.Click on “Algorithms” C.Click on “Meandre_Tagcloud” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Click “Submit” button D.Once Job finishes, select Job Name E.View Results by clicking on “tagcloudtokencounts.html”

41 Learning Exercises (4) 4.Run Meandre_Tagcloud_with_Cleaning Algorithm A.Click on “Algorithms” B.Click on “Meandre_Tagcloud_with_Cleaning” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “tagcloudcleantokencounts.html”

42 Learning Exercises (5) 5.Run Meandre_Dunning_LogLikelihood_to_Tagcloud Algorithm A.Click on “Algorithms” B.Click on “Meandre_Dunning_LogLikelihood_to_Tagcloud” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Select a collection for analysis (required) 4.Select a collection for reference (required) 5.Adjust Additional Parameters (optional) 6.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “dunningtagcloud.html”

43 Learning Exercises (6) 4.Run Meandre_OpenNLP_Entities_List Algorithm A.Click on “Algorithms” B.Click on “Meandre_OpenNLP_Entities_List” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) a.Provide a comma separated list of entity types to be extracted 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “named_entities_list.html”

44 Learning Exercises (7) 4.Run Meandre_OpenNLP_Date_Entities_To_Simile A.Click on “Algorithms” B.Click on “Meandre_OpenNLP_Date_Entities_To_Simile” 1.Provide Job Name (required) 2.Select a Workset (required) 3.Adjust Additional Parameters (optional) a.Provide a minimum year for filtering. (default: 1800) b.Provide a maximum year for filtering. (default: 1900) 4.Click “Submit” button C.Once Job finishes, select Job Name D.View Results by clicking on “date_entity_simile.html”

45 Attendee Project Plan Study/Project Title Team Members and their Affiliation Procedural Outline of Study/Project –Research Question/Purpose of Study –Data Sources –Analysis Tools Activity Timeline or Milestones Report or Project Outcome(s) Ideas on what your team needs from SEASR staff to help you achieve your goal. Identify Research Question Identify Research Question

46 Discussion Questions What analytical tools or applications do you want to utilize with HT data?


Download ppt "HTRC Loretta Auvil, Boris Capitanu University of Illinois at Urbana-Champaign"

Similar presentations


Ads by Google