Download presentation
Presentation is loading. Please wait.
1
Identifying Translations Philip Resnik, Noah Smith University of Maryland
2
Reasons to identify translations Locating parallel text on the Web Filtering out poor quality translations Cross-language duplicate detection/caching
3
Identifying translations using structure 0.750.75 0.900.90 261261 J1J2,STRANDJ1J2,STRAND 0.690.69 0.880.88 315315 J2,STRANDJ2,STRAND 0.700.70 0.880.88 273273 J1,STRANDJ1,STRAND 0.950.95 0.980.98 267267 J1,J2J1,J2 κ%N ComparisonComparison STRAND (Resnik, 1999)
4
Related Work Web mining for parallel text (Nie et al. 1999) Sentence alignment (Fluhr et al. 2000) Duplicate detection (e.g. Broder et al. 1997)
5
Translational Equivalence as a Function over Sets Broder et al (1997): Document representation as a set of “shingles” S(D) r(D1,D2) = |S(D1) S(D2)| |S(D1) S(D2)| Cross language generalization: partial equality e = f t with confidence value t(e,f) used to define and tt
6
Ways of computing equivalence Bilingual dictionaries –t(e,f) = 1 if (e,f) present in dictionary, 0 otherwise Translation model (Melamed 2000, model A) –t(e,f) = Pr(e,f) String similarity for cognates –t(e,f) = Longest common substring ratio (LCSR) variant –Trained on non-zero entries in translation model
7
Evaluation task Given segmented corpus C1 in L1, C2 in L2 –Assume each segment has 0 or 1 translation equivalents –Match up the equivalents Equivalent to maximum bipartite matching problem –Exhaustive solution available for small sets –Approximated using competitive linking (Melamed) True equivalence pairs give precision/recall curve
8
Some results: sentence matching Task corpora: –Chinese-English: Hong Kong Laws sentences 5622 training sentences, 191 test sentences –Spanish-English: U.N. Parallel Corpus 4695 training sentences, 200 test sentences English-ChineseEnglish-Spanish
9
Some results: document matching Task corpora: –232 English-French Web documents
10
New directions Exploiting the Internet Archive –100-200 million pages (4TB) on disk –Exhaustive URL matching within site –STRAND now adapted for disk-based access Combining structure and content –Improving document-level matching –Selecting good chunks within documents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.