Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University.

Similar presentations


Presentation on theme: "An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University."— Presentation transcript:

1 An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong Database Research Group

2 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20152SCST@DASFAA 2011

3 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20153SCST@DASFAA 2011

4 Background Parallel sentences on the Web ▪ Sentences with the well-translated counterpart ▪ An English-to-Chinese Example A rich source for translation Commercial Systems 10/13/20154SCST@DASFAA 2011 Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com

5 Parallel Sentences E.g., The result is good 结果很好 Parallel Sentences E.g., The result is good 结果很好 Background 10/13/20155SCST@DASFAA 2011 Parallel Sentence Database Sen 1 (E-C) Sen2 (E-C) Sen3 (E-C) sen n (E-C) …… Closest Sentences with Translation Query Sentence (English) Web Parallel Sentence Discovery and Extraction Sentence-Level Translation Aid Sentence Matching An effective similarity model between sentences in the source language (e.g., English sentences) Research Issue

6 Motivation 10/13/20156SCST@DASFAA 2011 Existing approaches: ▪ Word-based, e.g., translation model, edit distance, … ▪ Gram-based, e.g., N-gram, V-gram ▪ All subsequences of a sentence Cannot capture the order of words Don’t consider the syntactic information Too expensive We propose a phrase-based similarity model 1.Syntactic information 2.Frequency information 3.Lengths of phrases

7 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/20157SCST@DASFAA 2011

8 Problem Definition 10/13/20158SCST@DASFAA 2011 Data : A Database of Parallel Sentences Translator Query : Query Sentence (English) Answer : Sentences with its translations … Sentence 1 : English - Chinese Sentence 2 : English - Chinese Sentence 3 : English - Chinese

9 Phrase-Based Sentence Matching 10/13/20159SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online

10 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201510SCST@DASFAA 2011

11 Phrase-Based Similarity Model 10/13/201511SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online

12 Similarity Model 10/13/201512SCST@DASFAA 2011 sim(q,s) = ∑ f ∈ F q ∩F s φ(q,f)φ(q,f)φ(s,f)φ(s,f) Query Sentence, q A Sentence in the DB, s Phrase Set, F q Phrase Set, F s f 1, f 2, f 3, ……, f m f' 1', f' 2, f' 3, ……, f' n w(f)w(f) φ(q,f): syntactic importance of f to q φ(s,f): syntactic importance of f to s Shared Phrases: f ∈ Fq∩Fs Shared Phrases: f ∈ Fq∩Fs w(f): weight of f (IDF) Fq∩FsFq∩Fs FsFs

13 Syntactic Importance of Phrases 10/13/201513SCST@DASFAA 2011 φ(q,f)φ(q,f) Sentence q Phrase f He has eaten an apple he eaten apple =Π m α m Π g β g has an Gap Dependency Tree eaten he apple has an α0α0 d·α 0 d2·α0d2·α0 d : a decay factor β g : penalty (constant) α m : syntactic weight of matched term

14 Features of the Similarity Model More General ▪ Subsumes Jaccard, Cosine similarity,… Syntactic Information ▪ Weight of matched terms ▪ Weight of terms in the gap Frequency Information ▪ Weight of phrases 10/13/201514SCST@DASFAA 2011

15 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201515SCST@DASFAA 2011

16 High-Quality Phrase Selection 10/13/201516SCST@DASFAA 2011 q Phrase f 1 Phrase f 2 Phrase f n …… s Phrase f’ 1 Phrase f’ 2 Phrase f’ n …… Similarity Model Similarity Model Parallel Sentences Phrase Selection Phrase Selection Phrase Database Offline Online

17 High-Quality Phrase Extend grams by allowing discontinuous terms A heuristic for selecting phrases ▪ Gap constraint: syntactic relationship of discontinuous terms ▪ Frequency constraint: infrequent (large IDF) ▪ Maximum constraint: 1) not a prefix; 2) max. length 10/13/201517SCST@DASFAA 2011 He has eaten an apple Sentence q he eaten apple syntactic Frequency # of sentences In the DB having it

18 Phrase Selection Selecting phrases with gap and maximum constraints 10/13/201518SCST@DASFAA 2011 He ate a red apple Sentence s he eat red apple Sentence  Graph 1)Sequential relationship 2)Syntactic relationship Longest path from a node = A phrase satisfying Gap constraint Maximum constraint

19 Phrase Selection 10/13/201519SCST@DASFAA 2011 Select phrases with frequency constraint (Threshold = 2 ) Sentences in the DB He has an apple He ate a red apple He has a pencil He has N0(8) N1(4)N1(4) N2(3)N2(3) N 27 (1)N4(1)N4(1) N 28 (0)N5(0)N5(0) he have pencil apple ## N9(1)N9(1) eat N 11 (1) red N 15 (1) apple N 13 (1) apple # N 14 (0) have eat red … … … Use a frequency trie N 29 (0) # Prune freq- uent phrases

20 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201520SCST@DASFAA 2011

21 Experiment Setup Data Sets ▪ D I : 520,899 parallel sentences from ICIBA ▪ D C : 800,000 parallel sentences from CNKI Baseline Methods ▪ Jaccard Coefficient, Edit Distance, Cosine Similarity ▪ Translation Model Methods (TM) ▪ Cosine Similarity with VGRAM 10/13/201521SCST@DASFAA 2011

22 Experiment Setup Evaluation Metrics ▪ BLEU ◦ A well known metric for machine translation ◦ Example: ▪ Precision ◦ A user study to label whether the translations are useful 10/13/201522SCST@DASFAA 2011 q: He has eaten an apple s: He has a pencil 他吃了一个苹果 他有一支铅笔 Ref. Translation Translation BLEU

23 Effects of Phrase Selection 10/13/201523SCST@DASFAA 2011 Effect on max. length on D I Effect on freq. threshold on D C

24 Comparison with Similarity Models 10/13/201524SCST@DASFAA 2011 Comparison on the D I data set

25 Comparison with Existing Methods 10/13/201525SCST@DASFAA 2011 Comparison on the D C data set

26 User Studies Methods used in commercial systems 10/13/201526SCST@DASFAA 2011 Comparison on the D I data set

27 Outline Introduction Overview of Our Approach Phrase-Based Similarity Model Phrase Selection Experiments Conclusion 10/13/201527SCST@DASFAA 2011

28 Conclusion Searching closest sentence translations from the Web A phrase-based sentence similarity model High-quality phrase selection methods Extensive experiments and user studies 10/13/201528SCST@DASFAA 2011

29 10/13/2015SCST@DASFAA 201129 ThanksThanks My Homepage: http://dbgroup.cs.tsinghua.edu/fanjuhttp://dbgroup.cs.tsinghua.edu/fanju

30 Frequency Constraint Index structures ▪ Phrase  Sentence Frequent phrases  large inverted index 10/13/201530SCST@DASFAA 2011


Download ppt "An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University."

Similar presentations


Ads by Google