2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Chapter 5: Introduction to Information Retrieval
Modern Language Association (MLA) International Bibliography Hosted by Gale Cengage Welcome to our Guided Tour Tour takes about 7 minutes. The show will.
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Search Engines and Information Retrieval
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Advance Information Retrieval Topics Hassan Bashiri.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 Cross Language Information Retrieval (CLIR) Modern Information Retrieval Sharif University of Technology Fall 2005 Mohsen Jamali.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Information Retrieval in Practice
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Multilingual Indexes for Detection and Translation
Introduction to Information Retrieval
Cross Language Information Retrieval (CLIR)
Introduction to Search Engines
Presentation transcript:

SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring Principles of Information Retrieval Lecture 28: CLIR

SLIDE 2IS 240 – Spring 2006 Mini-TREC Proposed Schedule –February – Database and previous Queries –March 2 – report on system acquisition and setup –March 2, New Queries for testing… –April 20, Results due –April 25, Results and system rankings (sort of) –May 9, Group reports and discussion

SLIDE 3IS 240 – Spring 2006 Results (with bad runs)

SLIDE 4IS 240 – Spring 2006 With new runs…

SLIDE 5IS 240 – Spring 2006 Mean Average Precision

SLIDE 6IS 240 – Spring 2006 Today Review –NLP for IR –Text Summarization Cross-Language Information Retrieval –Introduction –Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen

SLIDE 7IS 240 – Spring 2006 Today Review –NLP for IR –Text Summarization Cross-Language Information Retrieval –Introduction –Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen

SLIDE 8IS 240 – Spring 2006 Natural Language Processing and IR The main approach in applying NLP to IR has been to attempt to address –Phrase usage vs individual terms –Search expansion using related terms/concepts –Attempts to automatically exploit or assign controlled vocabularies

SLIDE 9IS 240 – Spring 2006 NLP and IR Much early research showed that (at least in the restricted test databases tested) –Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) –Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

SLIDE 10IS 240 – Spring 2006 NLP and IR Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods –E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches

SLIDE 11IS 240 – Spring 2006 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John John is a student. He runs. Slide from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 12IS 240 – Spring 2006 Using NLP Strzalkowski (in Reader) TextNLPrepres Dbase search TAGGER NLP: PARSERTERMS

SLIDE 13IS 240 – Spring 2006 Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np./per

SLIDE 14IS 240 – Spring 2006 Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np./per

SLIDE 15IS 240 – Spring 2006 Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]

SLIDE 16IS 240 – Spring 2006 Using NLP EXTRACTED TERMS & WEIGHTS President soviet President+soviet president+former Hero hero+local Invade tank Tank+invade tank+russian Russian wisconsin

SLIDE 17IS 240 – Spring 2006 NLP & IR Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

SLIDE 18IS 240 – Spring 2006 NLP & IR New “Question Answering” track at TREC has been exploring these areas –Usually statistical methods are used to retrieve candidate documents –NLP techniques are used to extract the likely answers from the text of the documents

SLIDE 19IS 240 – Spring 2006 Today Review –NLP for IR –Text Summarization Cross-Language Information Retrieval –Introduction –Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen

SLIDE 20IS 240 – Spring 2006 Introduction to CLIR Slides from Doug Oard…

SLIDE 21IS 240 – Spring 2006 Cross-Language IR Given a query expressed in one language Find info that may be expressed in another –Electronic texts –Document images –Recorded speech [101] –Sign language Retrieval System English QueryFrench Documents

SLIDE 22IS 240 – Spring 2006 Why Do Cross-Language IR? When users can read several languages –Eliminates multiple queries –Query in most fluent language Monolingual users can also benefit –If translations can be provided –If it suffices to know that a document exists –If text captions are used to search for images

SLIDE 23IS 240 – Spring 2006 What We Know Dictionaries are very useful –Easily get to 50% of monolingual IR effectiveness –We can get to about 75% using: Part-of-speech tags Pseudo-relevance feedback Phrase indexing Multilingual training corpora are also useful –When the corpus is from the right domain

SLIDE 24IS 240 – Spring 2006 Related Issues Multiscript text processing [12] –Character sets, writing system, direction,... Language identification [109] –Markup, detection Language-specific processing [103] –Stemming, morphological roots, compounds, … Document translation [51]

SLIDE 25IS 240 – Spring 2006 Term-aligned Sentence-aligned Document-aligned Unaligned Parallel Comparable Knowledge-based Corpus-based Controlled Vocabulary Free Text Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Ontology-based Dictionary-based Thesaurus-based

SLIDE 26IS 240 – Spring 2006 Free Text Developments 1970, 1973 Salton –Hand coded bilingual dictionaries 1990 Latent Semantic Indexing [53] –French/English using Hansard training corpus 1994 European multilingual IR project [84] –Medium-scale recall/precision evaluation 1996 SIGIR Cross-lingual IR workshop –And over 10 conferences and workshops since!

SLIDE 27IS 240 – Spring 2006 How Controlled Vocabulary Works Thesaurus design [102] –Design a knowledge structure for domain –Assign a unique “descriptor” to each concept Include “scope notes” and “lead-in vocabulary” Document indexing –Read the document, assign appropriate descriptors Retrieval –Select desired descriptors, use exact match retrieval

SLIDE 28IS 240 – Spring 2006 Multilingual Thesauri Adapt the knowledge structure –Cultural differences influence indexing choices Use language-independent descriptors –Matched to a unique term in each language Three construction techniques [46] –Build it from scratch –Translate an existing thesaurus –Merge monolingual thesauri

SLIDE 29IS 240 – Spring 2006 Advantages over Free Text High-quality concept-based indexing –Descriptors need not appear in the document Knowledge-guided searching –Good thesauri capture expert domain knowledge Excellent cross-language effectiveness –Up to 100% of monolingual effectiveness Understandable retrieval results Efficient implementation

SLIDE 30IS 240 – Spring 2006 Limitations Costly to create –Design knowledge structure, index each document Costly to maintain –Document indexing, vocabulary and concept change Hard to use –Vocabulary choice, knowledge structure navigation Limited scope –Domain must be chosen at design time

SLIDE 31IS 240 – Spring 2006 Query vs. Document Translation Query translation –Very efficient for short queries Not as big an advantage for relevance feedback –Hard to resolve ambiguous query terms Document translation –May be needed by the selection interface And supports adaptive filtering well –Slow, but only need to do it once per document Poor scale-up to large numbers of languages

SLIDE 32IS 240 – Spring 2006 Document Translation Example Approach –Select a single query language –Translate every document into that language –Perform monolingual retrieval Long documents provide enough context –And many translation errors do not hurt retrieval Much of the generation effort is wasted –And choosing a single translation can hurt

SLIDE 33IS 240 – Spring 2006 Query Translation Example Select controlled vocabulary search terms Retrieve documents in desired language Form monolingual query from the documents Perform a monolingual free text search Information Need Thesaurus Controlled Vocabulary Multilingual Text Retrieval System Alta Vista French Query Terms English Abstracts English Web Pages

SLIDE 34IS 240 – Spring 2006 Machine Readable Dictionaries Based on printed bilingual dictionaries –Becoming widely available Used to produce bilingual term lists –Cross-language term mappings are accessible Sometimes listed in order of most common usage –Some knowledge structure is also present Hard to extract and represent automatically The challenge is to pick the right translation

SLIDE 35IS 240 – Spring 2006 Unconstrained Query Translation Replace each word with every translation –Typically 5-10 translations per word About 50% of monolingual effectiveness –Main problem is ambiguity –Example: Fly (English) 8 word senses (e.g., to fly a flag) 13 Spanish translations (enarbolar, ondear, …) 38 English retranslations (hoist, brandish, lift…)

SLIDE 36IS 240 – Spring 2006 Phrase Indexing Improves retrieval effectiveness two ways –Phrases are less ambiguous than single words –Idiomatic phrases translate as a single concept Three ways to identify phrases –Semantic (e.g., appears in a dictionary) –Syntactic (e.g., parse as a noun phrase) –Cooccurrence (words found together often) Semantic phrase results are impressive

SLIDE 37IS 240 – Spring 2006 Types of Bilingual Corpora Parallel corpora: translation-equivalent pairs –Document pairs –Sentence pairs –Term pairs Comparable corpora –Content-equivalent document pairs Unaligned corpora –Content from the same domain

SLIDE 38IS 240 – Spring 2006 Generating Parallel Corpora Parallel corpora are naturally domain-tuned –Finding one for the right domain may be hard Alternative is to build one –Start with a monolingual corpus –Automatic machine translation for second language Worthwhile when IR technique is faster than MT –If translation errors don’t hurt the IR technique Good results with Latent Semantic Indexing

SLIDE 39IS 240 – Spring 2006 Top ranked French Documents French Text Retrieval System Alta Vista French Query Terms English Translations English Web Pages Parallel Corpus Pseudo-Relevance Feedback Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search

SLIDE 40IS 240 – Spring 2006 Similarity-Based Dictionaries Automatically developed from aligned documents –Reflects language use in a specific domain For each term, find most similar in other language –Retain only the top few (5 or so) Performs as well as dictionary-based techniques –Evaluated on a comparable corpus of news stories [98] Stories were automatically linked based on date and subject

SLIDE 41IS 240 – Spring 2006 Latent Semantic Indexing Designed for better monolingual effectiveness –Works well across languages too [27] Cross-language is just a type of term choice variation Produces short dense document vectors –Better than long sparse ones for adaptive filtering Training data needs grow with dimensionality –Not as good for retrieval efficiency Always 300 multiplications, even for short queries

SLIDE 42IS 240 – Spring 2006 Cooccurrence-Based Dictionaries Align terms using cooccurrence statistics –How often do a term pair occur in sentence pairs? Weighted by relative position in the sentences –Retain term pairs that occur unusually often Useful for query translation –Excellent results when the domain is the same Also practical for document translation –Term use variations to reinforce good translations

SLIDE 43IS 240 – Spring 2006 Language Identification Can be specified using metadata –Included in HTTP and HTML Determined using word-scale features –Which dictionary gets the most hits? Determined using subword features –Letter n-grams in electronic and printed text –Phoneme n-grams in speech

SLIDE 44IS 240 – Spring 2006 Research Directions User needs assessment Evaluation Corpus construction Word sense disambiguation System integration Probabilistic models Adaptive filtering

SLIDE 45IS 240 – Spring 2006 Evaluation Most critical need is for side by side tests –TREC-did this for French/German/Italian Domain shift metric –Domain shift hurts corpus-based techniques –Need a way to measure severity of the shift Test collections for adaptive filtering –From cross-language recall/precision evaluation

SLIDE 46IS 240 – Spring 2006 Corpus Construction Corpus-based techniques have great potential Parallel corpora are rare and expensive –Find it, reverse engineer the links, clean it up Unlinked corpora are of limited value –Context linking research could change that [77] Comparable corpora offer middle ground –Need to develop automatic linking techniques –Also need a metric for degree of comparability

SLIDE 47IS 240 – Spring Find and Interpret Information Vital to National Security The Tamil National leader,Mr. V.Pirapaharan delivered a speech on 13 May 1998, the anniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands, describing how the LTTE defended against Sri Lanka's latest military ambitions. Here’s what he said: 62Million people in South India and Sri Lanka can read this Find and retrieve information in unfamiliar languages Translate it into English Extract and correlate its content against other materials TIDES

SLIDE 48IS 240 – Spring The Challenges Today is a significant day in the history of our national liberation struggle, it marks the end of a year during which we have resisted and fought against the biggest ever offensive operation launched by the Sri Lankan armed forces code named "Jayasikuru”... Translation Topic Detection Summarization Extraction The objective of the Sinhala chauvinists was to utilize maximum man power and fire power to destroy the military capability of the LTTE and to bring an end to the Tamil freedom movement. Before the launching of the operation "Jayasikuru" the Sri Lankan political and military high command miscalculated the military strength and determination of the LTTE. Liberation Tigers ofTamil Eelam(LTTE) Sri Lanka Velupillai Pirapaharan Rebellion OrgLeaderHQLosses SinhalaKumaratunga3000 LTTEPirapaharanWanni1300 (manual) (experimental) (special-purpose) (key sentences) Tamildocument Tamildocument analysis

SLIDE 49IS 240 – Spring 2006 Cross-Language IR on the Web –Most workshop proceedings –Lots of papers and project descriptions –Links to working systems Including 2 web search engines –Useful linguistic resources –BibTeX for the attached bibliography

SLIDE 50IS 240 – Spring 2006 Today Review –NLP for IR –Text Summarization Cross-Language Information Retrieval –Introduction –Cross-Language EVIs Credit for some of the material in this lecture goes to Doug Oard (University of Maryland) and to Fredric Gey and Aitao Chen

SLIDE 51IS 240 – Spring 2006 The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University of California, Berkeley School of Information Management and Systems and UC Data Harvesting Translingual Vocabulary Mappings for Multilingual Digital Libraries Note: This talk was presented at the 2002 JCDL (Joint Conference on Digital Libraries)

SLIDE 52IS 240 – Spring 2006 Overview What are Entry Vocabulary Indexes? –EVI Research at Berkeley –Notion of an EVI –How are EVIs Built Berkeley Multilingual EVI –Technology components –Database –Examples of operation Ongoing research

SLIDE 53IS 240 – Spring 2006 Entry Vocabulary Index Research Projects at Berkeley DARPA Information Management Program –“Search Support for Unfamiliar Metadata Vocabularies” Institute for Museum and Library Services –“Seamless Searching of Numeric and Textual Resources” DARPA TIDES program –“Translingual Information Management Using Domain Ontologies” NSF/NASA/DARPA: DLI-2 (IDL) –“ Discovery and Use of Textual, Numeric and Spatial Data”

SLIDE 54IS 240 – Spring 2006 The IMLS project: To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases: --- books, articles, and their bibliographic records; --- numerical data in socio-economic databases. PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched. PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.

SLIDE 55IS 240 – Spring 2006 TIDES Project Translingual Information Detection, Extraction and Summarization –Building EVIs to map across languages Using same notion with training data in different languages Using Library of Congress Subject Headings from the CDL MELVYL database

SLIDE 56IS 240 – Spring 2006 What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

SLIDE 57IS 240 – Spring 2006 Start with a collection of documents.

SLIDE 58IS 240 – Spring 2006 Classify and index with controlled vocabulary. Index Ideally, use a database already indexed

SLIDE 59IS 240 – Spring 2006 Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index

SLIDE 60IS 240 – Spring 2006 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

SLIDE 61IS 240 – Spring 2006 EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

SLIDE 62IS 240 – Spring 2006 But why stop there? Index EVI

SLIDE 63IS 240 – Spring 2006 “Which EVI do I use?” Index EVI Index EVI Index EVI

SLIDE 64IS 240 – Spring 2006 EVI to EVIs Index EVI Index EVI Index EVI EVI 2

SLIDE 65IS 240 – Spring 2006 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

SLIDE 66IS 240 – Spring 2006 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

SLIDE 67IS 240 – Spring 2006 Background on Online Library Catalogs Library catalogs have been automated at a furious pace worldwide since the late ’70s Library objects (books, maps, pictures) in 400+ languages Bibliographic descriptions contain one or more sentences from a particular language (transliterated) Objects have been classified by subject by librarians –Library of Congress Subject Heading (Islamic Fundamentalism) –Library of Congress Classification (BP60, BP63, KF27) –Dewey Decimal Classification (297.2, 306.6, 320.5) International standard (MARC) for catalog metadata Huge number of remotely searchable catalogs worldwide accessible using the international search/retrieve protocol Z39.50

SLIDE 68IS 240 – Spring 2006 What can libraries and their catalogs provide? Millions of sentences in multiple languages Sentences with topical content identified from 150,000 Library of Congress Subject Headings Transfer point (interlingua) between English topics and words in other languages Can be used to create: –Bilingual dictionaries –Query expansion in cross-language information retrieval

SLIDE 69IS 240 – Spring 2006 Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic” Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”

SLIDE 70IS 240 – Spring 2006 Our Training Set and Prototype University of California/CDL MELVYL catalog Private copy, 10 million+ records (5 million non- English) Records in over 100 languages Obtained in MARC database standard format Foreign language titles use Library of Congress transliteration (Romanization) standard Prototype search software maps from/to English and –Arabic, Chinese, French, German –Italian, Japanese, Russian, Spanish

SLIDE 71IS 240 – Spring 2006 Technical Details Download a set of training data. Build associations between extracted terms & controlled vocabularies. Part of speech tagging Extract terms (words and noun phrases) from titles and abstracts. Building an Entry Vocabulary Module (EVI) For noun phrases Internet DB indexed with a controlled vocabulary.

SLIDE 72IS 240 – Spring 2006 Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a classification in the training set

SLIDE 73IS 240 – Spring 2006 Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p 1,a,a+b) + logL(p 2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p 1 = p 2 = p= a a+b c c+d a+c a+b+c+d Vis. Dunning

SLIDE 74IS 240 – Spring 2006 Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

SLIDE 75IS 240 – Spring 2006 Non-English words can be mapped to English subject headings

SLIDE 76IS 240 – Spring 2006 Examples

SLIDE 77IS 240 – Spring 2006 Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records) Approx. language distribution (Berkeley # sentences, FBIS est. # lines source) BerkeleyFBIS BerkeleyFBIS German840,03249,872 Danish41,51718,688 Spanish614,025388,772 Hebrew41,4683,500 French609,0892,871 Czech35,4323,647 Russian341,05015,415 Urdu30,206 Italian266, Turkish30,015 Portuguese149,38924,930 Bulgarian27,850 Chinese127,636246,549 Norwegian26,47813,596 Japanese110,956 Korean25,97968,607 Arabic96,124(8263)* Rumanian25,874 Dutch90,170 Finnish25,0278,187 Latin88,818 Thai24,693 Polish81,698 Serbo-Croatian24,60136,139 Indonesian59,445 Greek23,926 Swedish53,85416,652 Bengali23,430 Hungarian46,3306,631 Catalan20,392 Hindi42,886 Tamil20,232 *English only, no source text 106 languages with > 500 records

SLIDE 78IS 240 – Spring 2006 Future Research Add content from other online library catalogs –RLIN (>30M records, >900K Chinese, >250K Arabic) –COPAC [UK] (9M records, 40k Arabic) Transliteration and back-transliteration for scripted languages Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information) Further evaluation (TREC, CLEF, NCTIR and local analysis)