Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

Similar presentations


Presentation on theme: "7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University."— Presentation transcript:

1 7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University of California, Berkeley School of Information Management and Systems and UC Data Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries

2 7/16/2002JCDL 2002, Ray Larson Overview What are Entry Vocabulary Indexes? –EVI Research at Berkeley –Notion of an EVI –How are EVIs Built Berkeley Multilingual EVI –Technology components –Database –Examples of operation Ongoing research

3 7/16/2002JCDL 2002, Ray Larson Entry Vocabulary Index Research Projects at Berkeley DARPA Information Management Program –“Search Support for Unfamiliar Metadata Vocabularies” Institute for Museum and Library Services –“Seamless Searching of Numeric and Textual Resources” DARPA TIDES program –“Translingual Information Management Using Domain Ontologies” NSF/NASA/DARPA: DLI-2 (IDL) –“ Discovery and Use of Textual, Numeric and Spatial Data”

4 7/16/2002JCDL 2002, Ray Larson The IMLS project: To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases: --- books, articles, and their bibliographic records; --- numerical data in socio-economic databases. PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched. PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.

5 7/16/2002JCDL 2002, Ray Larson TIDES Project Translingual Information Detection, Extraction and Summarization –Building EVIs to map across languages Using same notion with training data in different languages Using Library of Congress Subject Headings from the CDL MELVYL database

6 7/16/2002JCDL 2002, Ray Larson What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

7 7/16/2002JCDL 2002, Ray Larson Start with a collection of documents.

8 7/16/2002JCDL 2002, Ray Larson Classify and index with controlled vocabulary. Index Ideally, use a database already indexed

9 7/16/2002JCDL 2002, Ray Larson Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index

10 7/16/2002JCDL 2002, Ray Larson Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

11 7/16/2002JCDL 2002, Ray Larson EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

12 7/16/2002JCDL 2002, Ray Larson But why stop there? Index EVI

13 7/16/2002JCDL 2002, Ray Larson “Which EVI do I use?” Index EVI Index EVI Index EVI

14 7/16/2002JCDL 2002, Ray Larson EVI to EVIs Index EVI Index EVI Index EVI EVI 2

15 7/16/2002JCDL 2002, Ray Larson Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

16 7/16/2002JCDL 2002, Ray Larson Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

17 7/16/2002JCDL 2002, Ray Larson Background on Online Library Catalogs Library catalogs have been automated at a furious pace worldwide since the late ’70s Library objects (books, maps, pictures) in 400+ languages Bibliographic descriptions contain one or more sentences from a particular language (transliterated) Objects have been classified by subject by librarians –Library of Congress Subject Heading (Islamic Fundamentalism) –Library of Congress Classification (BP60, BP63, KF27) –Dewey Decimal Classification (297.2, 306.6, 320.5) International standard (MARC) for catalog metadata Huge number of remotely searchable catalogs worldwide accessible using the international search/retrieve protocol Z39.50

18 7/16/2002JCDL 2002, Ray Larson What can libraries and their catalogs provide? Millions of sentences in multiple languages Sentences with topical content identified from 150,000 Library of Congress Subject Headings Transfer point (interlingua) between English topics and words in other languages Can be used to create: –Bilingual dictionaries –Query expansion in cross-language information retrieval

19 7/16/2002JCDL 2002, Ray Larson Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic” Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”

20 7/16/2002JCDL 2002, Ray Larson Our Training Set and Prototype University of California/CDL MELVYL catalog Private copy, 10 million+ records (5 million non- English) Records in over 100 languages Obtained in MARC database standard format Foreign language titles use Library of Congress transliteration (Romanization) standard Prototype search software maps from/to English and –Arabic, Chinese, French, German –Italian, Japanese, Russian, Spanish

21 7/16/2002JCDL 2002, Ray Larson Technical Details Download a set of training data. Build associations between extracted terms & controlled vocabularies. Part of speech tagging Extract terms (words and noun phrases) from titles and abstracts. Building an Entry Vocabulary Module (EVI) For noun phrases Internet DB indexed with a controlled vocabulary.

22 7/16/2002JCDL 2002, Ray Larson Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a classification in the training set

23 7/16/2002JCDL 2002, Ray Larson Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p 1,a,a+b) + logL(p 2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p 1 = p 2 = p= a a+b c c+d a+c a+b+c+d Vis. Dunning

24 7/16/2002JCDL 2002, Ray Larson Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

25 7/16/2002JCDL 2002, Ray Larson Non-English words can be mapped to English subject headings

26 7/16/2002JCDL 2002, Ray Larson Examples

27 7/16/2002JCDL 2002, Ray Larson Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records) Approx. language distribution (Berkeley # sentences, FBIS est. # lines source) BerkeleyFBIS BerkeleyFBIS German840,03249,872 Danish41,51718,688 Spanish614,025388,772 Hebrew41,4683,500 French609,0892,871 Czech35,4323,647 Russian341,05015,415 Urdu30,206 Italian266,424254 Turkish30,015 Portuguese149,38924,930 Bulgarian27,850 Chinese127,636246,549 Norwegian26,47813,596 Japanese110,956 Korean25,97968,607 Arabic96,124(8263)* Rumanian25,874 Dutch90,170 Finnish25,0278,187 Latin88,818 Thai24,693 Polish81,698 Serbo-Croatian24,60136,139 Indonesian59,445 Greek23,926 Swedish53,85416,652 Bengali23,430 Hungarian46,3306,631 Catalan20,392 Hindi42,886 Tamil20,232 *English only, no source text 106 languages with > 500 records

28 7/16/2002JCDL 2002, Ray Larson Future Research Add content from other online library catalogs –RLIN (>30M records, >900K Chinese, >250K Arabic) –COPAC [UK] (9M records, 40k Arabic) Transliteration and back-transliteration for scripted languages Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information) Further evaluation (TREC, CLEF, NCTIR and local analysis)

29 7/16/2002JCDL 2002, Ray Larson Prototype available http://otlet.sims.berkeley.edu/mulevm2.html


Download ppt "7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University."

Similar presentations


Ads by Google