CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.

Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.

Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.

Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.

1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.

CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.

“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.

The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield

Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:

Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.

MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.

1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Chapter 23: Probabilistic Language Models April 13, 2004.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim

Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

From Frequency to Meaning: Vector Space Models of Semantics

Statistical Machine Translation Part II: Word Alignments and EM

Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

A German Corpus for Similarity Detection

Large scale multilingual and multimodal integration

Presentation transcript:

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean, E. Gaussier, J-M Renders Xerox Research Center Europe Xerox Research Center Europe Alexei Vinokourov Royal Holloway University of London Royal Holloway University of London

CLEF2003 Forum/ August 2003 / Trondheim / page 2 AgendaAgenda Objective and meansObjective and means Linguistic PreprocessingLinguistic Preprocessing MethodsMethods –Canonical Correlation Analysis (CCA) for CLIR –Combining lexicons automatically extracted from parallel and comparable corpora ResultsResults

CLEF2003 Forum/ August 2003 / Trondheim / page 3 Objectives and Means How to improve the adequacy of existing resources (dictionaries) to translate queries:How to improve the adequacy of existing resources (dictionaries) to translate queries: –Coverage? –Precision (translation adapted to the corpus)? First way: exploit parallel corporaFirst way: exploit parallel corpora –Extract semantic, language-independent representation –Extract bilingual lexicons Second way: exploit comparable corporaSecond way: exploit comparable corpora –Extract (probabilistic) translation relationships –Must be combined with other translation resources (parallel)

CLEF2003 Forum/ August 2003 / Trondheim / page 4 The Task (first participation) Multi-lingual 4:Multi-lingual 4: –English, German, Spanish, French Fully automatic approach (no manual processing of the queries)Fully automatic approach (no manual processing of the queries) Query language:Query language: –English Performance measure:Performance measure: –Non-interpolated average precision (non limited to 1000 documents) –Macro-average on all queries: Before submission (training): from 2000 to 2002 (140 queries)Before submission (training): from 2000 to 2002 (140 queries) After submission (evaluation): from 2001 to 2003After submission (evaluation): from 2001 to 2003

CLEF2003 Forum/ August 2003 / Trondheim / page 5 Resources we used General Dictionary: ELRA (40k entries)General Dictionary: ELRA (40k entries) Parallel corpora:Parallel corpora: –Hansard corpus (for CCA) – only French-English –JOC corpus (for lexicon extraction) – 300,000 sentences Comparable corpora:Comparable corpora: –The CLEF2003 corpora

CLEF2003 Forum/ August 2003 / Trondheim / page 6 Summary of approaches Semantic ProjectionSemantic Projection –A semantic, language independent space, is extracted from a parallel training corpus –Language-dependent projection matrices are built –Both documents and queries are projected –Standard cosine measure is then used in the new space to perform IR Query translationQuery translation –A probabilistic translation matrix is extracted from a parallel training corpus and the comparable CLEF corpus –Queries are translated by these translation matrices –Standard cosine measure is then used between the original documents and the translated query

CLEF2003 Forum/ August 2003 / Trondheim / page 7 Linguistic Preprocessing Lemmatized and (POS)tagged corporaLemmatized and (POS)tagged corpora Partial segmentation of German compounds (lexicon-based) + some simple heuristicsPartial segmentation of German compounds (lexicon-based) + some simple heuristics Normalization of spelling and accentuation (e.g. umlaut and eszett)Normalization of spelling and accentuation (e.g. umlaut and eszett) POS-based word filtering (N,V,AD)POS-based word filtering (N,V,AD) Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for translation are context-dependant to some extent.Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for translation are context-dependant to some extent.

CLEF2003 Forum/ August 2003 / Trondheim / page 8 CCA for CLIR Given a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally correlated projectionsGiven a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally correlated projections s1 s2 s3 t2 t1 t3

CLEF2003 Forum/ August 2003 / Trondheim / page 9 CCA for CLIR (II) CCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languagesCCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languages Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.)Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.) Then, these (linear) combinations of terms are able to locate the underlying semanticsThen, these (linear) combinations of terms are able to locate the underlying semantics Results in language-independent concepts and the corresponding (language-dependant) projection operatorsResults in language-independent concepts and the corresponding (language-dependant) projection operators Both queries and documents are projected – Traditional similarity measures (cosine) are then used for retrievalBoth queries and documents are projected – Traditional similarity measures (cosine) are then used for retrieval

CLEF2003 Forum/ August 2003 / Trondheim / page 10 Extraction of bilingual resources Upper bound of the coverage for the CLEF200x English query termsUpper bound of the coverage for the CLEF200x English query terms Automatically extracted lexicons provides better coverage, but translation accuracy can be degradedAutomatically extracted lexicons provides better coverage, but translation accuracy can be degraded Use of some form of trade-off between the resources (manual/automatic)Use of some form of trade-off between the resources (manual/automatic)

CLEF2003 Forum/ August 2003 / Trondheim / page 11 Extracting lexicons from parallel corpora Statistical Alignment methods :Statistical Alignment methods : –starting from alignment at the sentence level –Iterative Proportional Fitting Procedure (normalizing and restoring consistency in the raw co-occurrence matrix of source/target terms in aligned sentences) –Probabilistic translation matrix: P 1 (t|s)

CLEF2003 Forum/ August 2003 / Trondheim / page 12 Extracting lexicons from comparable corpora Assumption: if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as wellAssumption: if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as well Corresponding method:Corresponding method: –Build context vectors for source words s: CV(s) –Build context vectors for target words t: CV(t) –Translate the context vectors using standard dictionary (as a bootstrap): TR(CV(t)) –Compute the similarity between s and t by cos(CV(s),TR(CV(t)) –Normalize the similarities to yield a probabilistic translation lexicon P 2 (t|s) –NB: CV are based on windows centered on s or t, and weighted by some association measure (such as Mutual Information); the word itself is included in the CV  bias for dictionary entries

CLEF2003 Forum/ August 2003 / Trondheim / page 13 Hybrid Method : model combination In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information extracted from the parallel one is best.In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information extracted from the parallel one is best. We adopted a simple linear combination scheme, but more elaborate approaches existWe adopted a simple linear combination scheme, but more elaborate approaches exist q t =(  P 1 (t|s) + (1-  ) P 2 (t|s)) q s We optimized  on the queries (performance measure: average precision)We optimized  on the queries (performance measure: average precision)

CLEF2003 Forum/ August 2003 / Trondheim / page 14 Multilingual merging As we used consistent translation matrices and weighting scheme for all languages, only length normalization was performed before merging the scoresAs we used consistent translation matrices and weighting scheme for all languages, only length normalization was performed before merging the scores We also extracted a P 2 (t|s) translation matrix for English; this realizes some kind of query expansion based on contextual similarity.We also extracted a P 2 (t|s) translation matrix for English; this realizes some kind of query expansion based on contextual similarity.

CLEF2003 Forum/ August 2003 / Trondheim / page 15 Weighting schemes For submission:For submission: –Documents: ltc –Query before translation: ntnbefore translation: ntn After translation: nncAfter translation: nnc After submissionAfter submission –Documents: Lnu –Query: ntn (before), nic (after) Measure of association in the context vector:Measure of association in the context vector: –Mutual information –Window size: 5

CLEF2003 Forum/ August 2003 / Trondheim / page 16 Results (1) CCA: failedCCA: failed –Only bilingual –Based on a small set of Hansard (disjoint from CLEF2003) –The training corpus was reduced to 1000 paragraphs to be practically feasible and to provide results on time –To be extended in the future

CLEF2003 Forum/ August 2003 / Trondheim / page 17 Results (II) – 2000, 2001 and 2002 queries

CLEF2003 Forum/ August 2003 / Trondheim / page 18 Results (Details) – 2000,2001, 2002 queries Average PrecisionELRAParallelComparablHybridMonolingual Bilingual (before merging) Multilingual (after merging) ENG FRE GER SPA

CLEF2003 Forum/ August 2003 / Trondheim / page 19 Results of hybridation parallel/comparable bilingual multilingual

CLEF2003 Forum/ August 2003 / Trondheim / page 20 Results (details) … after submission Mainly focused on changing the weighting scheme (Lnu)Mainly focused on changing the weighting scheme (Lnu) Average precision (retrieval limited to 1000 documents):Average precision (retrieval limited to 1000 documents): Setting Average Precision ltc/ntn/nnc (submitted) Lnu/ntn/nnc (same tuning as subm.) Lnu/ntn/ntc (re-optimised tuning)

CLEF2003 Forum/ August 2003 / Trondheim / page 21 ConclusionsConclusions Clearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performanceClearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performance When considering the monolingual reference line, there is still place for improvementWhen considering the monolingual reference line, there is still place for improvement Also, different merge strategies must be investigated Also, different merge strategies must be investigated