Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.

Slides:



Advertisements
Similar presentations
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Advertisements

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Overview of Search Engines
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Information Retrieval in Practice
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Survey of Semantic Annotation Platforms
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
Querying Structured Text in an XML Database By Xuemei Luo.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
Structured Use of External Knowledge for Event-based Open Domain Question Answering Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh National University.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
A Language Independent Method for Question Classification COLING 2004.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from.
Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
JAVELIN Project Briefing AQUAINT Program 1 AQUAINT Workshop, October 2005 JAVELIN Project Briefing Eric Nyberg, Teruko Mitamura, Jamie Callan, Robert Frederking,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Vector Space Models.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Automatic Question Answering  Introduction  Factoid Based Question Answering.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Statistical NLP: Lecture 13
A research literature search engine with abbreviation recognition
CS246: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL 2006

Introduction JAVELIN is a modular and extensible architecture for building QA systems. Since JAVELIN is language independent, this paper extend the original English version for CLQA in Chinese and Japanese.

JAVELIN The JAVELIN architecture: Question Analyzer  Producing keywords Retrieval Strategist  Finding documents Information Extractor  Extracting answers Answer Generator  Ranking answers

JAVELIN Extension Extension for Cross-Lingual QA:

Question Analyzer Input questions are processed using the RASP parser (Korhonen and Briscoe, 2004). The module output contains three main components: a) selected keywords b) the answer type (e.g. numeric-expression, person-name, location) c) the answer subtype (e.g. author, river, city) The QA module is extended with a keyword translation sub-module.

Translation Module The TM selects the best combination of translated keywords from several sources:  Machine Readable Dictionaries (MRDs)  Machine Translation systems (MTs)  Web-mining-Based Keyword Translators (WBMTs) (Nagata et al., 2001, Li et al., 2003). For E-to-J translation, two MRDs, eight MTs and one WBMT are used.  If none of them return a translation, the word is transliterated into kana for Japanese. For E-to-C translation, one MRD, three MTs and one WBMT are used.

Translation Module After gathering all possible translations for every keyword, the TM uses a noisy channel model to select the best combination of translated keywords: The TM estimates model statistics using the World Wide Web:

TM: Smoothing When the target keyword co-occurrence count with n keywords is below a set threshold, a moving window of size n-1 would “move” through the keywords in sequence. If the co-occurrence count of all of these sets is above the threshold, return the product of the score of these sets as the language model score. If not, decrease the window and repeat until either all of the split sets are above the threshold or n = 1. Drawbacks: 1. "Moving-window smoothing" assumes that keywords that are next to each other are also more semantically related, which may not always be the case. 2. "Moving-window smoothing" tends to give the keywords near the middle of the question more weight, which may not be desirable.

TM: Pruning 1. Early Pruning:  Possible translations of the individual keywords are pruned before being combined.  A very simple pruning heuristic via a word frequency list is used. Very rare translations produced by a resource are not considered. 2. Late Pruning:  Possible translation candidates of the entire set of keywords are pruned after calculating translation probabilities.  Since the calculation of the translation probabilities requires little access to the web, we can calculate only the language model score for the top N candidates with the highest translation score and prune the rest.

Retrieval Strategist Lemur 3.0 toolkit (Ogilvie and Callan, 2001) is used. Lemur supports structured queries using operators such as Boolean AND, Synonym, Ordered/Un-Ordered Window and NOT. For example: The RS uses an incremental relaxation technique:  It starts from an initial query that is highly constrained.  The algorithm searches for all the keywords and data types in close proximity to each other.  The priority is based on a function of the likely answer type, keyword type (word, proper name, or phrase) and the inverse document frequency of each keyword.  The query is gradually relaxed until the desired number of relevant documents is retrieved.

Information Extractor The Light IX module:  The algorithm considers only those terms that are tagged as named entities which match the desired answer type.  E-C uses character-based tokenization and the linear Dist measure.  E-J uses word-based tokenization and logarithmic Dist measure.

Answer Generator The task of the AG module is to produce a ranked list of answer candidates from the IX output. The AG is designed to resolve representational differences and combine answer candidates that differ only in surface form. Even though the AG module plays an important role in JAVELIN, its full potential is not used in the E-C and E-J systems, since some language-specific resources required for multilingual answer merging are lacked.

Experiment Settings Three different runs were carried out for both the E- C and E-J systems. The same 200-question test set and the document corpora provided by the NTCIR CLQA task is used. The first run:  A fully automatic run using the original translation module in the CLQA system. The second run:  The keywords selected by the Question Analyzer module are manually translated. The third run:  The corresponding term for each English keyword in the translations for the English questions provided by NTCIR organizers is looked up.

Experiment Results

Discussions Chinese:  Word sense disambiguation for keywords e.g. “ 埋 ” and “ 葬 ”  Regional language difference e.g. “ 软件 ” and “ 軟體 ” Japanese:  Representational gaps e.g. “ ヴェルナー ・ シュピース ” and “ ヴェルナー シュピース ”. This problem can be resolved by Lemur.  Transliteration in WBMT Japanese nouns written in romaji (e.g. Funabashi) are transliterated them into hiragana for a better result in WBMT. Since there may be higher positive co-occurrence between kana and kanji (i.e. “ ふ なばし ” and “ 船橋 ”) than between romaji and kanji (i.e. “funabashi” and “ 船橋 ”).  Document Retrieval in Kana Transliteration from kana into kanji is more ambiguous than transliteration from romanji into kana. Therefore, indexing kana readings in the corpus and querying in kana is sometimes a useful technique for CLQA.