7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.

Slides:



Advertisements
Similar presentations
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
March 2006NaCTeM – Ray R. Larson Prof. Ray R. Larson University of California, Berkeley School of Information Metadata as Infrastructure for Information.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Page 1 / 28 Aytac, Development of a User-Centered Digital Library... Development of a User-Centered Digital Library for Ottoman Manuscripts Selenay Aytaç.
CerOrganic European Conference – Athens, 6/12/2011 Giannis Stoitsis, Alexios Dimitropoulos Agro-Know Technologies.
Clients for XProtect VMS What’s new presentation
< Translator Team > 25+ Languages, …and growing!.
Arab E-Marefa Database. What is E-Marefa? Marefa is an Arab online database includes full text articles to more than 1373 academic & statistical journals.
Searching Text and Data via Common Geography 1 SEARCHING TEXT AND DATA via COMMON GEOGRAPHY Geographic Information Retrieval: Searching Text and Data via.
Access to Digital Heritage Resources using What, Where, When and Who Michael Buckland Electronic Cultural Atlas Initiative University of California, Berkeley.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.
Seamless Searching of Numeric and Textual Resources Funded by a National Library Leadership Grant from the Institute of Museum and Library Services Michael.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
ECAI – CAA Conference, Fargo, April 19, 2006 Geo-temporal Indexing: Events, Lives, and Geographical Features Michael Buckland also Kim Carl, Sarah Ellinger.
The Significance of Vocabulary Michael Buckland School of Information Management and Systems University of California, Berkeley.
Prof. Ray R. Larson University of California, Berkeley School of Information Developing a Metadata Infrastructure for Information Access: What, Where,
Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Resource for Librarians and Teachers of ESOL; English as a Second Language Lending Library.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
1 The Gateway to Information: Simplifying Access to Library Resources Fred Roecker Head Instruction The Ohio State University Libraries
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
Betsy L. Humphreys Betsy L. Humphreys Associate Director for Library Operations NLM, NIH, HHS NLM, NIH, HHS National Library.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
© Pennsylvania Department of Education What is POWER Library ?
Library Research. Objectives Locate books and articles in the library using the online catalog Explore subject directories Explore digital libraries and.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Subject Gateway KIV SUBJECT GATEWAY – WHAT IS IT? Internet based service To locate high quality information available on the Internet.
New RCLayout. Do product layout 3 improvements All products Local databases New functionalities.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Connexion Comparison Client or Browser? Fran Juergensmeyer Waukegan Public Library 2 nd Annual WILIUG Conference June 16, 2006 Cataloging from A (Authority)
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
The physical parts of a computer are called hardware.
What can Parents Do to Help Their Children Learn?.
© 2009 AccuWeather, Inc. Proprietary1. 2 Weather content around the globe. Dan Ryan New Media Sales
What Does It Mean to “Use Dewey”? Dewey Breakfast/Update ALA Annual Conference Chicago July 11, 2009.
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
CLIR PATENTSCOPE search system Cyberworld February 2016 Sandrine Ammann Marketing & Communications Officer.
The ___ is a global network of computer networks Internet.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
GBIF NODES Committee Meeting Copenhagen, Denmark 4 th October 2009 The GBIF Integrated Publishing Toolkit Alberto GONZÁLEZ-TALAVÁN Programme Officer for.
Measuring Monolinguality
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Professional development training on cataloging at the University Wisconsin-Madison Memorial Library, USA 14th October -24th October, 2016 Aigerim Shurshenova.
Sales Presenter Available now
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Multilingual Indexes for Detection and Translation
Sales Presenter Available now
Oracle Supplier Management Solution Product Availability
Vocabulary, Statistics, Time and Geography
Digital Asset Management Part 11: Access
CLIR PATENTSCOPE search system
Multilingual Information Access in a Digital Library

Big Data Sources – Web, Social media and Text Analytics
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
Sales Presenter Available now Standard v Slim
Lars Björnshauge, Lund University Libraries
Introduction to Search Engines

Presentation transcript:

7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University of California, Berkeley School of Information Management and Systems and UC Data Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries

7/16/2002JCDL 2002, Ray Larson Overview What are Entry Vocabulary Indexes? –EVI Research at Berkeley –Notion of an EVI –How are EVIs Built Berkeley Multilingual EVI –Technology components –Database –Examples of operation Ongoing research

7/16/2002JCDL 2002, Ray Larson Entry Vocabulary Index Research Projects at Berkeley DARPA Information Management Program –“Search Support for Unfamiliar Metadata Vocabularies” Institute for Museum and Library Services –“Seamless Searching of Numeric and Textual Resources” DARPA TIDES program –“Translingual Information Management Using Domain Ontologies” NSF/NASA/DARPA: DLI-2 (IDL) –“ Discovery and Use of Textual, Numeric and Spatial Data”

7/16/2002JCDL 2002, Ray Larson The IMLS project: To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases: --- books, articles, and their bibliographic records; --- numerical data in socio-economic databases. PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched. PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.

7/16/2002JCDL 2002, Ray Larson TIDES Project Translingual Information Detection, Extraction and Summarization –Building EVIs to map across languages Using same notion with training data in different languages Using Library of Congress Subject Headings from the CDL MELVYL database

7/16/2002JCDL 2002, Ray Larson What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

7/16/2002JCDL 2002, Ray Larson Start with a collection of documents.

7/16/2002JCDL 2002, Ray Larson Classify and index with controlled vocabulary. Index Ideally, use a database already indexed

7/16/2002JCDL 2002, Ray Larson Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index

7/16/2002JCDL 2002, Ray Larson Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

7/16/2002JCDL 2002, Ray Larson EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

7/16/2002JCDL 2002, Ray Larson But why stop there? Index EVI

7/16/2002JCDL 2002, Ray Larson “Which EVI do I use?” Index EVI Index EVI Index EVI

7/16/2002JCDL 2002, Ray Larson EVI to EVIs Index EVI Index EVI Index EVI EVI 2

7/16/2002JCDL 2002, Ray Larson Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

7/16/2002JCDL 2002, Ray Larson Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

7/16/2002JCDL 2002, Ray Larson Background on Online Library Catalogs Library catalogs have been automated at a furious pace worldwide since the late ’70s Library objects (books, maps, pictures) in 400+ languages Bibliographic descriptions contain one or more sentences from a particular language (transliterated) Objects have been classified by subject by librarians –Library of Congress Subject Heading (Islamic Fundamentalism) –Library of Congress Classification (BP60, BP63, KF27) –Dewey Decimal Classification (297.2, 306.6, 320.5) International standard (MARC) for catalog metadata Huge number of remotely searchable catalogs worldwide accessible using the international search/retrieve protocol Z39.50

7/16/2002JCDL 2002, Ray Larson What can libraries and their catalogs provide? Millions of sentences in multiple languages Sentences with topical content identified from 150,000 Library of Congress Subject Headings Transfer point (interlingua) between English topics and words in other languages Can be used to create: –Bilingual dictionaries –Query expansion in cross-language information retrieval

7/16/2002JCDL 2002, Ray Larson Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic” Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”

7/16/2002JCDL 2002, Ray Larson Our Training Set and Prototype University of California/CDL MELVYL catalog Private copy, 10 million+ records (5 million non- English) Records in over 100 languages Obtained in MARC database standard format Foreign language titles use Library of Congress transliteration (Romanization) standard Prototype search software maps from/to English and –Arabic, Chinese, French, German –Italian, Japanese, Russian, Spanish

7/16/2002JCDL 2002, Ray Larson Technical Details Download a set of training data. Build associations between extracted terms & controlled vocabularies. Part of speech tagging Extract terms (words and noun phrases) from titles and abstracts. Building an Entry Vocabulary Module (EVI) For noun phrases Internet DB indexed with a controlled vocabulary.

7/16/2002JCDL 2002, Ray Larson Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a classification in the training set

7/16/2002JCDL 2002, Ray Larson Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p 1,a,a+b) + logL(p 2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p 1 = p 2 = p= a a+b c c+d a+c a+b+c+d Vis. Dunning

7/16/2002JCDL 2002, Ray Larson Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

7/16/2002JCDL 2002, Ray Larson Non-English words can be mapped to English subject headings

7/16/2002JCDL 2002, Ray Larson Examples

7/16/2002JCDL 2002, Ray Larson Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records) Approx. language distribution (Berkeley # sentences, FBIS est. # lines source) BerkeleyFBIS BerkeleyFBIS German840,03249,872 Danish41,51718,688 Spanish614,025388,772 Hebrew41,4683,500 French609,0892,871 Czech35,4323,647 Russian341,05015,415 Urdu30,206 Italian266, Turkish30,015 Portuguese149,38924,930 Bulgarian27,850 Chinese127,636246,549 Norwegian26,47813,596 Japanese110,956 Korean25,97968,607 Arabic96,124(8263)* Rumanian25,874 Dutch90,170 Finnish25,0278,187 Latin88,818 Thai24,693 Polish81,698 Serbo-Croatian24,60136,139 Indonesian59,445 Greek23,926 Swedish53,85416,652 Bengali23,430 Hungarian46,3306,631 Catalan20,392 Hindi42,886 Tamil20,232 *English only, no source text 106 languages with > 500 records

7/16/2002JCDL 2002, Ray Larson Future Research Add content from other online library catalogs –RLIN (>30M records, >900K Chinese, >250K Arabic) –COPAC [UK] (9M records, 40k Arabic) Transliteration and back-transliteration for scripted languages Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information) Further evaluation (TREC, CLEF, NCTIR and local analysis)

7/16/2002JCDL 2002, Ray Larson Prototype available