Download presentation
Presentation is loading. Please wait.
Published byDebra Farmer Modified over 9 years ago
1
An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong Database Research Group
2
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20152SCST@DASFAA 2011
3
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20153SCST@DASFAA 2011
4
Background Parallel sentences on the Web ▪ Sentences with the well-translated counterpart ▪ An English-to-Chinese Example A rich source for translation Commercial Systems 10/13/20154SCST@DASFAA 2011 Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com
5
Parallel Sentences E.g., The result is good 结果很好 Parallel Sentences E.g., The result is good 结果很好 Background 10/13/20155SCST@DASFAA 2011 Parallel Sentence Database Sen 1 (E-C) Sen2 (E-C) Sen3 (E-C) sen n (E-C) …… Closest Sentences with Translation Query Sentence (English) Web Parallel Sentence Discovery and Extraction Sentence-Level Translation Aid Sentence Matching An effective similarity model between sentences in the source language (e.g., English sentences) Research Issue
6
Motivation 10/13/20156SCST@DASFAA 2011 Existing approaches: ▪ Word-based, e.g., translation model, edit distance, … ▪ Gram-based, e.g., N-gram, V-gram ▪ All subsequences of a sentence Cannot capture the order of words Don’t consider the syntactic information Too expensive We propose a phrase-based similarity model 1.Syntactic information 2.Frequency information 3.Lengths of phrases
7
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20157SCST@DASFAA 2011
8
Problem Definition 10/13/20158SCST@DASFAA 2011 Data : A Database of Parallel Sentences Translator Query : Query Sentence (English) Answer : Sentences with its translations … Sentence 1 : English - Chinese Sentence 2 : English - Chinese Sentence 3 : English - Chinese
9
Phrase-Based Sentence Matching 10/13/20159SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online
10
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201510SCST@DASFAA 2011
11
Phrase-Based Similarity Model 10/13/201511SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online
12
Similarity Model 10/13/201512SCST@DASFAA 2011 sim(q,s) = ∑ f ∈ F q ∩F s φ(q,f)φ(q,f)φ(s,f)φ(s,f) Query Sentence, q A Sentence in the DB, s Phrase Set, F q Phrase Set, F s f 1, f 2, f 3, ……, f m f' 1', f' 2, f' 3, ……, f' n w(f)w(f) φ(q,f): syntactic importance of f to q φ(s,f): syntactic importance of f to s Shared Phrases: f ∈ Fq∩Fs Shared Phrases: f ∈ Fq∩Fs w(f): weight of f (IDF) Fq∩FsFq∩Fs FsFs
13
Syntactic Importance of Phrases 10/13/201513SCST@DASFAA 2011 φ(q,f)φ(q,f) Sentence q Phrase f He has eaten an apple he eaten apple =Π m α m Π g β g has an Gap Dependency Tree eaten he apple has an α0α0 d·α 0 d2·α0d2·α0 d : a decay factor β g : penalty (constant) α m : syntactic weight of matched term
14
Features of the Similarity Model More General ▪ Subsumes Jaccard, Cosine similarity,… Syntactic Information ▪ Weight of matched terms ▪ Weight of terms in the gap Frequency Information ▪ Weight of phrases 10/13/201514SCST@DASFAA 2011
15
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201515SCST@DASFAA 2011
16
High-Quality Phrase Selection 10/13/201516SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online
17
High-Quality Phrase Extend grams by allowing discontinuous terms A heuristic for selecting phrases ▪ Gap constraint: syntactic relationship of discontinuous terms ▪ Frequency constraint: infrequent (large IDF) ▪ Maximum constraint: 1) not a prefix; 2) max. length 10/13/201517SCST@DASFAA 2011 He has eaten an apple Sentence q he eaten apple syntactic Frequency # of sentences In the DB having it
18
Phrase Selection Selecting phrases with gap and maximum constraints 10/13/201518SCST@DASFAA 2011 He ate a red apple Sentence s he eat red apple Sentence Graph 1)Sequential relationship 2)Syntactic relationship Longest path from a node = A phrase satisfying Gap constraint Maximum constraint
19
Phrase Selection 10/13/201519SCST@DASFAA 2011 Select phrases with frequency constraint (Threshold = 2 ) Sentences in the DB He has an apple He ate a red apple He has a pencil He has N0(8) N1(4)N1(4) N2(3)N2(3) N 27 (1)N4(1)N4(1) N 28 (0)N5(0)N5(0) he have pencil apple ## N9(1)N9(1) eat N 11 (1) red N 15 (1) apple N 13 (1) apple # N 14 (0) have eat red … … … Use a frequency trie N 29 (0) # Prune freq- uent phrases
20
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201520SCST@DASFAA 2011
21
Experiment Setup Data Sets ▪ D I : 520,899 parallel sentences from ICIBA ▪ D C : 800,000 parallel sentences from CNKI Baseline Methods ▪ Jaccard Coefficient, Edit Distance, Cosine Similarity ▪ Translation Model Methods (TM) ▪ Cosine Similarity with VGRAM 10/13/201521SCST@DASFAA 2011
22
Experiment Setup Evaluation Metrics ▪ BLEU ◦ A well known metric for machine translation ◦ Example: ▪ Precision ◦ A user study to label whether the translations are useful 10/13/201522SCST@DASFAA 2011 q: He has eaten an apple s: He has a pencil 他吃了一个苹果 他有一支铅笔 Ref. Translation Translation BLEU
23
Effects of Phrase Selection 10/13/201523SCST@DASFAA 2011 Effect on max. length on D I Effect on freq. threshold on D C
24
Comparison with Similarity Models 10/13/201524SCST@DASFAA 2011 Comparison on the D I data set
25
Comparison with Existing Methods 10/13/201525SCST@DASFAA 2011 Comparison on the D C data set
26
User Studies Methods used in commercial systems 10/13/201526SCST@DASFAA 2011 Comparison on the D I data set
27
Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201527SCST@DASFAA 2011
28
Conclusion Searching closest sentence translations from the Web A phrase-based sentence similarity model High-quality phrase selection methods Extensive experiments and user studies 10/13/201528SCST@DASFAA 2011
29
10/13/2015SCST@DASFAA 201129 ThanksThanks My Homepage: http://dbgroup.cs.tsinghua.edu/fanjuhttp://dbgroup.cs.tsinghua.edu/fanju
30
Frequency Constraint Index structures ▪ Phrase Sentence Frequent phrases large inverted index 10/13/201530SCST@DASFAA 2011
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.