LEADS-4-NDP: Fellowship LEADS Fellow: Sam Grabus, Drexel University, Metadata Research Center LEADS Site: Temple University Library, Digital Scholarship Center Mentor: Peter Logan
<Indexing the Data Set of 19th Century Knowledge> Sam Grabus, Drexel University; Peter Logan, Temple University; Jane Greenberg PROJECT GOALS Temple’s broad data science question: Investigate how the specification of concepts change over time across 4 historical Encyclopedia Britannicas (1797-1911) Use automatic indexing to create individual entry descriptive metadata that can be used for analysis across all 4 editions APPROACH Identify encyclopedia entry terms that exist in all 4 editions of the Encyclopedia Britannica Automatically index entries with HIVE using contemporary LCSH and keyword extraction algorithms ACCOMPLISHMENTS Data cleaning with R Intersected 4 lists of entry terms to determine which terms appear in all 4 editions of the encyclopedia; created TXT files for each entry Ran sample TXT files through HIVE to generate automatic indexing results Tested 3 keyword extraction algorithms: Kea, Maui, & RAKE Compared LCSH and Agrovoc vocabularies Identified challenges & next steps for optimizing RAKE algorithm parameters & addition of historical controlled vocabularies to HIVE AUTOMATIC INDEXING WITH HIVE *Acknowledgements to Joan Boone
Moving onwards with an NEH grant Next Steps: Moving onwards with an NEH grant Identifying appropriate historical knowledge organization systems Challenges of using contemporary LCSH for 19th century text e.g., Encyclopedia entries for Raleigh (Sir Walter), SIR (Information Retrieval System) Digitize one or more of the historical vocabularies into XML or SKOS 1910-1914 LCSH Universal Decimal Classification (UDC) Dewey Decimal Classification (DDC) e.g., SKOS for “Rum” in LCSH Relevancy testing to identify most effective vocabularies for indexing these encyclopedia entries