Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Chapter 5: Information Retrieval and Web Search
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Querying Structured Text in an XML Database By Xuemei Luo.
Translating Unknown Queries with Web Corpora for Cross- Language Information Retrieval Pu-Jen Cheng, Jei-Wen Teng, Ruei- Cheng Chen, Jenq-Haur Wang, Wen-
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Vector Space Models.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
IR 6 Scoring, term weighting and the vector space model.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Based Information Retrieval
Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Compact Query Term Selection Using Topically Related Text
Applying Key Phrase Extraction to aid Invalidity Search
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications University, Korea (2) Chungnam National University, Korea (3) Seoul National University,Korea ACL 2006

Problem English terms and their equivalences in a local language refer to the same concept, they are erroneously treated as independent index units in traditional MIR Such separation of semantically identical words in different languages may limit retrieval performance

Problem - example Viterbi Algorithm, 韦特比算法 two same semantic Chinese terms “ 维特比 ” and “ 韦特比 ” correspond to one English term “Viterbi”.

Concept Unification For concept unification in index, firstly key English phrases should be extracted from local Web pages. After translating them into the local language,the English phrase and their translations are treated as the same index units for IR.

Method - steps Key Phrase Extraction Translation of English Phrases  Extraction of Candidates for Selection  Selection of candidates Statistical model Phonetic and semantic model

Method - steps Key Phrase Extraction Translation of English Phrases  Extraction of Candidates for Selection  Selection of candidates Statistical model Phonetic and semantic model

Key Phrase Extraction to decide what kinds of key English phrases in local Web pages are necessary to be conceptually unified only select the English phrases that occurred within the special marks including quotation marks and parenthesis

Method - steps Key Phrase Extraction Translation of English Phrases  Extraction of Candidates for Selection  Selection of candidates Statistical model Phonetic and semantic model

Translation of English Phrases query the search engine with English phrases to retrieve the local Web pages containing them  Extraction of Candidates for Selection  Selection of candidates

Method - steps Key Phrase Extraction Translation of English Phrases  Extraction of Candidates for Selection  Selection of candidates Statistical model Phonetic and semantic model

Extraction of Candidates for Selection get the snippets (title and summary) of Web texts in the returned search result pages to extract translation candidates within a window of a limited size  employ all sequential combinations of the words as the candidates  Example 韦特比 : “ 韦 ”,“ 特 ”, “ 比 ”, “ 韦特 ”, “ 特比 ”, and “ 韦特比 ”

Extraction of Candidates for Selection window size k number of words n predefined maximum length of candidate m n≤m  all adjacent sequential combinations n>m  only adjacent sequential combinations of which the word number is less than m are regarded as candidates example: if set n to 4 and m to 2, the window “w1 w2 w3 w4 ” only “w1 w2 ”,“w2 w3 ”, “w3 w4 ”, “w1 ”, “w2 ” , “w3 ”, “w4 ”

Method - steps Key Phrase Extraction Translation of English Phrases  Extraction of Candidates for Selection  Selection of candidates Statistical model Phonetic and semantic model

Selection of candidates In selection step, we first remove most of the noise candidates based on the statistical method and re-rank the candidates based on the semantic and phonetic similarity

Statistical model the frequency of co-occurrence and the textual distance Assumption : if the candidate is the right translation  tends to co-occur with the key phrase frequently  location tends to be close to the key phrase  the longer the candidates’ length, the higher the chance to be the right translation d k ( q, c) is the word distance between the English phrase q and the candidate c i in the k-th occurrence of candidate in the search-result pages

Selection of candidates In selection step, we first remove most of the noise candidates based on the statistical method and re-rank the candidates based on the semantic and phonetic similarity Example : “Attack of the clones”  Phonetic : clones as “ 克隆人 ”  Semantic : Attack as “ 進攻 ”

Phonetic and semantic model SSP : the sum of the semantic and phonetic mapping weights with English word in the key phrase and the local language word candidates English key phrase, E, represented as a sequence of tokens candidate in local language, C, be represented as a sequence of tokens An edge (ewi,cwj) is formed with the weight ω(ewi,cwj) which is calculated as the average of normalized semantic and phonetic values

Phonetic weight adopt the method in (Jeong et al.,1999) with some adjustments English phrase consists of a string of English alphabets e1,...,em the candidate in the local language is comprised of a string of phonetic elements. c1,...,ck. For Chinese language, the phonetic elements mean the elements of “pinying” gi is a pronunciation unit comprised of one or more English alphabets bi-grams

semantic weight is calculated from the bilingual dictionary The weight relies on the degree of overlaps between an English translation and the candidate Example : English phrase “Inha University” and its candidate “ 인하 대 ” “University” is translated into “ 대학교 ”, therefore, the semantic weight between “University” and “ 대 ” is about 0.33

Selection of candidates The strategy for us to pick up the final translation is distinct on two different aspects from the others. If the SSP values of all candidates are less than the threshold  the top one obtained by statistical model is selected as the final translation Otherwise, re-rank the candidates according to the SSP value. Then look down through the new rank list and  if there is a big jump of SSP value, draw a “virtual” line  If there is no big jump of SSP values, the “virtual” line is drawn at the bottom of the new rank list  the candidates above the “virtual” line are all selected as the final translations.

Experiment Data set :  downloaded 32 sites with 5,847 Korean Web pages  74 sites with 13,765 Chinese Web pages  232 and 746 English terms were extracted from Korean Web pages and Chinese Web pages, respectively

Experiment-1 The translation results The errors were brought in by containing additional related information or incomplete translation. Ex : “blue chip”, “ SIGIR 2005”

Experiment-2 compare our approach with two well known translation systems LiveTrans (Cheng et al., 2004) provided by the WKD lab in Academia Sinica “chi-square” method ( χ 2 ),and “context-vector” method (CV) and “chi-square” method ( χ 2 ) together the overall performance of LiveTrans’ combined method ( χ 2 +CV) is better than the simple method ( χ 2 ) in both Table 2 and 3, the same doesn’t hold for each individual Examine the contribution of distance information As in both Table 2 and Table 3, the later case shows better performance The statistical methods seem to be able to give a rough estimate for potential translations without giving high precision

Experiment-3 ran monolingual retrieval experiment to examine the impact of our concept unification on IR employed the standard tf × idf scheme, on KT-SET test collection The baseline against which we compared our approach applied a relatively simple indexing technique we obtained 14.9 % improvement based on mean average 11-pt precision

Future Work This is along the line of work where researchers attempt to index documents with concepts rather than words