Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University
18 July 2002Perseus Project, JCDL Corpus Preview
18 July 2002Perseus Project, JCDL Preview:
18 July 2002Perseus Project, JCDL What DLs can do for gazetteers Directly manage gazetteers Raw materials for gazetteers –Reference works –Monolingual and parallel corpora Testbeds for improving these technologies –E.g. alignment helps name tagging, and name tagging helps alignment
18 July 2002Perseus Project, JCDL Lexicographical parallels Original “slipping” process –First, get a madman... Creation of Brown and other corpora –Kucera and Lewis Cobuild dictionary and friends But names “get no respect” in lexicography (McDonald, 1996)
18 July 2002Perseus Project, JCDL Cultural dependencies
18 July 2002Perseus Project, JCDL Toponym Results
18 July 2002Perseus Project, JCDL Projection principles Exploits asymmetry in human language technologies (Yarowsky, HLT 2001) English, French, Chinese, Czech (!) have –POS taggers, morphological analyzers –Named entity identifiers –Parsers and bracketers Parallel corpus alignment allows projection of these resources
18 July 2002Perseus Project, JCDL Projection principles
18 July 2002Perseus Project, JCDL Projection on the cheap Align texts at coarse structural level Geocode source text (English) Optionally winnow target text (e.g. non- capitalized words where applicable) Calculate mutual information (Church & Hanks, 1990) Transliteration may be too ad hoc
18 July 2002Perseus Project, JCDL Preliminary results Greek/English testbed 98% precision 70.8% recall (Why?) Ethnic designations present interesting problems –“Stephanus of Byzantium” Morphology outside of English
18 July 2002Perseus Project, JCDL Proposals Preservation of gazetteer source materials DLs as home for gazetteer “slips” Parallel texts as key resource –(also cf. Berkeley TIDES work) Persistent documents as training sets for automatic methods