CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean, E. Gaussier, J-M Renders Xerox Research Center Europe Xerox Research Center Europe Alexei Vinokourov Royal Holloway University of London Royal Holloway University of London

CLEF2003 Forum/ August 2003 / Trondheim / page 2 AgendaAgenda Objective and meansObjective and means Linguistic PreprocessingLinguistic Preprocessing MethodsMethods –Canonical Correlation Analysis (CCA) for CLIR –Combining lexicons automatically extracted from parallel and comparable corpora ResultsResults

CLEF2003 Forum/ August 2003 / Trondheim / page 3 Objectives and Means How to improve the adequacy of existing resources (dictionaries) to translate queries:How to improve the adequacy of existing resources (dictionaries) to translate queries: –Coverage? –Precision (translation adapted to the corpus)? First way: exploit parallel corporaFirst way: exploit parallel corpora –Extract semantic, language-independent representation –Extract bilingual lexicons Second way: exploit comparable corporaSecond way: exploit comparable corpora –Extract (probabilistic) translation relationships –Must be combined with other translation resources (parallel)

CLEF2003 Forum/ August 2003 / Trondheim / page 4 The Task (first participation) Multi-lingual 4:Multi-lingual 4: –English, German, Spanish, French Fully automatic approach (no manual processing of the queries)Fully automatic approach (no manual processing of the queries) Query language:Query language: –English Performance measure:Performance measure: –Non-interpolated average precision (non limited to 1000 documents) –Macro-average on all queries: Before submission (training): from 2000 to 2002 (140 queries)Before submission (training): from 2000 to 2002 (140 queries) After submission (evaluation): from 2001 to 2003After submission (evaluation): from 2001 to 2003

CLEF2003 Forum/ August 2003 / Trondheim / page 5 Resources we used General Dictionary: ELRA (40k entries)General Dictionary: ELRA (40k entries) Parallel corpora:Parallel corpora: –Hansard corpus (for CCA) – only French-English –JOC corpus (for lexicon extraction) – 300,000 sentences Comparable corpora:Comparable corpora: –The CLEF2003 corpora

CLEF2003 Forum/ August 2003 / Trondheim / page 6 Summary of approaches Semantic ProjectionSemantic Projection –A semantic, language independent space, is extracted from a parallel training corpus –Language-dependent projection matrices are built –Both documents and queries are projected –Standard cosine measure is then used in the new space to perform IR Query translationQuery translation –A probabilistic translation matrix is extracted from a parallel training corpus and the comparable CLEF corpus –Queries are translated by these translation matrices –Standard cosine measure is then used between the original documents and the translated query

CLEF2003 Forum/ August 2003 / Trondheim / page 7 Linguistic Preprocessing Lemmatized and (POS)tagged corporaLemmatized and (POS)tagged corpora Partial segmentation of German compounds (lexicon-based) + some simple heuristicsPartial segmentation of German compounds (lexicon-based) + some simple heuristics Normalization of spelling and accentuation (e.g. umlaut and eszett)Normalization of spelling and accentuation (e.g. umlaut and eszett) POS-based word filtering (N,V,AD)POS-based word filtering (N,V,AD) Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for translation are context-dependant to some extent.Single word entries only (for the dictionaries, queries and documents) – Note that the adopted approaches for translation are context-dependant to some extent.

CLEF2003 Forum/ August 2003 / Trondheim / page 8 CCA for CLIR Given a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally correlated projectionsGiven a set of paired observations (paired sentences or paragraphs), Canonical Correlation Analysis finds maximally correlated projections s1 s2 s3 t2 t1 t3

CLEF2003 Forum/ August 2003 / Trondheim / page 9 CCA for CLIR (II) CCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languagesCCA looks for particular combinations of terms that appear to have the same co-occurrence patterns in both languages Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.)Hypothesis: the only thing both languages have in common is their meaning (cond. Independ.) Then, these (linear) combinations of terms are able to locate the underlying semanticsThen, these (linear) combinations of terms are able to locate the underlying semantics Results in language-independent concepts and the corresponding (language-dependant) projection operatorsResults in language-independent concepts and the corresponding (language-dependant) projection operators Both queries and documents are projected – Traditional similarity measures (cosine) are then used for retrievalBoth queries and documents are projected – Traditional similarity measures (cosine) are then used for retrieval

CLEF2003 Forum/ August 2003 / Trondheim / page 10 Extraction of bilingual resources Upper bound of the coverage for the CLEF200x English query termsUpper bound of the coverage for the CLEF200x English query terms Automatically extracted lexicons provides better coverage, but translation accuracy can be degradedAutomatically extracted lexicons provides better coverage, but translation accuracy can be degraded Use of some form of trade-off between the resources (manual/automatic)Use of some form of trade-off between the resources (manual/automatic)

CLEF2003 Forum/ August 2003 / Trondheim / page 11 Extracting lexicons from parallel corpora Statistical Alignment methods :Statistical Alignment methods : –starting from alignment at the sentence level –Iterative Proportional Fitting Procedure (normalizing and restoring consistency in the raw co-occurrence matrix of source/target terms in aligned sentences) –Probabilistic translation matrix: P 1 (t|s)

CLEF2003 Forum/ August 2003 / Trondheim / page 12 Extracting lexicons from comparable corpora Assumption: if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as wellAssumption: if 2 words are mutual translations, their more frequent collocates are likely to be mutual translations as well Corresponding method:Corresponding method: –Build context vectors for source words s: CV(s) –Build context vectors for target words t: CV(t) –Translate the context vectors using standard dictionary (as a bootstrap): TR(CV(t)) –Compute the similarity between s and t by cos(CV(s),TR(CV(t)) –Normalize the similarities to yield a probabilistic translation lexicon P 2 (t|s) –NB: CV are based on windows centered on s or t, and weighted by some association measure (such as Mutual Information); the word itself is included in the CV  bias for dictionary entries

CLEF2003 Forum/ August 2003 / Trondheim / page 13 Hybrid Method : model combination In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information extracted from the parallel one is best.In some cases, the information provided by the comparable corpus is more reliable; in other cases, the information extracted from the parallel one is best. We adopted a simple linear combination scheme, but more elaborate approaches existWe adopted a simple linear combination scheme, but more elaborate approaches exist q t =(  P 1 (t|s) + (1-  ) P 2 (t|s)) q s We optimized  on the queries 2000-2002 (performance measure: average precision)We optimized  on the queries 2000-2002 (performance measure: average precision)

CLEF2003 Forum/ August 2003 / Trondheim / page 14 Multilingual merging As we used consistent translation matrices and weighting scheme for all languages, only length normalization was performed before merging the scoresAs we used consistent translation matrices and weighting scheme for all languages, only length normalization was performed before merging the scores We also extracted a P 2 (t|s) translation matrix for English; this realizes some kind of query expansion based on contextual similarity.We also extracted a P 2 (t|s) translation matrix for English; this realizes some kind of query expansion based on contextual similarity.

CLEF2003 Forum/ August 2003 / Trondheim / page 15 Weighting schemes For submission:For submission: –Documents: ltc –Query before translation: ntnbefore translation: ntn After translation: nncAfter translation: nnc After submissionAfter submission –Documents: Lnu –Query: ntn (before), nic (after) Measure of association in the context vector:Measure of association in the context vector: –Mutual information –Window size: 5

CLEF2003 Forum/ August 2003 / Trondheim / page 16 Results (1) CCA: failedCCA: failed –Only bilingual –Based on a small set of Hansard (disjoint from CLEF2003) –The training corpus was reduced to 1000 paragraphs to be practically feasible and to provide results on time –To be extended in the future

CLEF2003 Forum/ August 2003 / Trondheim / page 17 Results (II) – 2000, 2001 and 2002 queries

CLEF2003 Forum/ August 2003 / Trondheim / page 18 Results (Details) – 2000,2001, 2002 queries Average PrecisionELRAParallelComparablHybridMonolingual Bilingual (before merging)0.290.3650.2280.3880.444 Multilingual (after merging)0.1920.2890.1650.302 0.361 ENG0.35 0.3640.3780.363 FRE0.2710.3620.1880.3890.449 GER0.2760.3610.2030.380.475 SPA0.3040.4110.2210.4310.439

CLEF2003 Forum/ August 2003 / Trondheim / page 19 Results of hybridation parallel/comparable bilingual multilingual

CLEF2003 Forum/ August 2003 / Trondheim / page 20 Results (details) … after submission Mainly focused on changing the weighting scheme (Lnu)Mainly focused on changing the weighting scheme (Lnu) Average precision (retrieval limited to 1000 documents):Average precision (retrieval limited to 1000 documents): Setting Average Precision ltc/ntn/nnc (submitted) 0.1860 Lnu/ntn/nnc (same tuning as subm.) 0.2118 Lnu/ntn/ntc (re-optimised tuning) 0.2341

CLEF2003 Forum/ August 2003 / Trondheim / page 21 ConclusionsConclusions Clearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performanceClearly, exploiting parallel and comparable corpora to enhance query translation improves CLIR performance When considering the monolingual reference line, there is still place for improvementWhen considering the monolingual reference line, there is still place for improvement Also, different merge strategies must be investigated Also, different merge strategies must be investigated

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Similar presentations

Presentation on theme: "CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Similar presentations

Presentation on theme: "CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,"— Presentation transcript:

Similar presentations

About project

Feedback