Identifying Translations Philip Resnik, Noah Smith University of Maryland
Reasons to identify translations Locating parallel text on the Web Filtering out poor quality translations Cross-language duplicate detection/caching
Identifying translations using structure J1J2,STRANDJ1J2,STRAND J2,STRANDJ2,STRAND J1,STRANDJ1,STRAND J1,J2J1,J2 κ%N ComparisonComparison STRAND (Resnik, 1999)
Related Work Web mining for parallel text (Nie et al. 1999) Sentence alignment (Fluhr et al. 2000) Duplicate detection (e.g. Broder et al. 1997)
Translational Equivalence as a Function over Sets Broder et al (1997): Document representation as a set of “shingles” S(D) r(D1,D2) = |S(D1) S(D2)| |S(D1) S(D2)| Cross language generalization: partial equality e = f t with confidence value t(e,f) used to define and tt
Ways of computing equivalence Bilingual dictionaries –t(e,f) = 1 if (e,f) present in dictionary, 0 otherwise Translation model (Melamed 2000, model A) –t(e,f) = Pr(e,f) String similarity for cognates –t(e,f) = Longest common substring ratio (LCSR) variant –Trained on non-zero entries in translation model
Evaluation task Given segmented corpus C1 in L1, C2 in L2 –Assume each segment has 0 or 1 translation equivalents –Match up the equivalents Equivalent to maximum bipartite matching problem –Exhaustive solution available for small sets –Approximated using competitive linking (Melamed) True equivalence pairs give precision/recall curve
Some results: sentence matching Task corpora: –Chinese-English: Hong Kong Laws sentences 5622 training sentences, 191 test sentences –Spanish-English: U.N. Parallel Corpus 4695 training sentences, 200 test sentences English-ChineseEnglish-Spanish
Some results: document matching Task corpora: –232 English-French Web documents
New directions Exploiting the Internet Archive – million pages (4TB) on disk –Exhaustive URL matching within site –STRAND now adapted for disk-based access Combining structure and content –Improving document-level matching –Selecting good chunks within documents