Download presentation
Presentation is loading. Please wait.
1
ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy) Joy+@cs.cmu.edu ISL, Carnegie Mellon Univ. July 08, 2002
2
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 2 Overview The advantage of phrase to phrase alignment Existing methods Algorithm Integrating bilingual information with monolingual information Experiments and results Discussion and future work
3
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 3 SMT and sub-sentential alignment Statistical Machine Translation (SMT) system is based on the noise channel model Translation Model Language Model
4
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 4 SMT and sub-sentential alignment (Cont.) Through sub-sentential alignment, we are training the Translation Model (TM) In our system, TM contains word to word, or phrase to phrase transducers. E.g.
5
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 5 Why phrases? Mismatch between languages
6
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 6 Why phrases? (Cont.) Phrases encapsulate the context of words – Tense:e.g. Word to word alignment Phrase to phrase alignment
7
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 7 Why phrases? (Cont.) Local reordering – E.g. Relative clauses in Chinese Which still needs global reordering, which is our future work
8
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 8 Why phrases? (Cont.) For languages need word segmentation, such as Chinese – The word segmenter can not segment the sentence perfectly, due to the incomplete coverage of word list and segmentation ambiguity – Previous work (Zhang 2001) tries to identify phrases in the corpus using only monolingual information and augment the word list with new phrases found Precision: Hard to decide on phrase boundary Prediction: Phrase identified may not occur in the future testing data
9
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 9 Why phrases? (Cont.) Example of using phrases to soothe word segmentation failure
10
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 10 Some alignment algorithms IBM models(Brown 93) HMM alignment: phrase to phrase (Vogel 96) Competitive links: word to word (Melamed 97) Flow network (Gaussier 98) Bitext Map (Melamed 01)
11
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 11 Algorithm Given a sentence pair (S,T), S= T=, where s i /t j are src/tgt words. Given an m*n matrix B, where B(i,j)= co-occurrence(s i, t j )= N=a+b+c+d; tjtj ~t j sisi ab ~s i cd
12
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 12 Algorithm (Cont.) Goal: find a partition over matrix B, under the constraint that one src/tgt word can only align to one tgt/src word or one tgt/src phrase (adjacent word sequence) Legal segmentation, imperfect alignmentIllegal segmentation, perfect alignment
13
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 13 Algorithm (Cont.) While(still has row or column not aligned){ Find cell[i,j], where B(i,j) is the max among all available(not aligned) cells; Expand cell[i,j] with similarity sim_thresh to region[RowStart,RowEnd; ColStart,ColEnd] Mark all the cells in the region as aligned } Output the aligned regions as phrases
14
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 14 Algorithm (Cont.) Expand cell[i,j] with sim_thresh current aligned region: region[RowStart=i, RowEnd=i; ColStart=j, ColEnd=j] While(still ok to expand){ if all cells[m,n], where m=RowStart-1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart --; //expand to north if all cells[m,n], where m=RowEnd+1, ColStart<=n<=ColEnd, B(m,n) is similar to B(i,j) then RowStart = RowStart ++; //expand to south … //expand to east … //expand to west } Define similar(x,y)= true, if abs((x-y)/y) < 1-similarity_thresh
15
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 15 Algorithm (Cont.) Expand to North Expand to South Expand to East Expand to West
16
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 16 Find the best similarity threshold Simlarity_threshold is critical in this algorithm The algorithm described above used ONE Simlarity_threshold value for all region expansions in the matrix, and the same ONE value for all sentence pairs Ideally, it is better to use different threshold values for each region and find the global best segmentation for one matrix – A search tree, combinational explosion
17
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 17 Find the best similarity threshold (Cont.) One practical solution: For one matrix B: For(st=0.1;st<=0.9;st+=0.1){ find segmentation of B given similarity_threshold = st; } Select the solution with the highest performance(solution)
18
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 18 Integrating monolingual information Motivation: – Use more information in the alignment – Easier for aligning phrases – There is much more monolingual data than bilingual data PittsburghLos Angeles Somerset Union townSanta Monica Santa Clarita Corona
19
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 19 Integrating monolingual information (Cont.) Given a sentence pair (S,T), S= and T=, where s i /t j are src/tgt words. Construct an m*m matrix A, where A(i,j) = collocation(s i, s j ); Only A(i,i-1) and A(i,i+1) have values Construct an n*n matrix C, where C(i,j) = collocation(t i, t j ); Only C(j-1,j) and A(j+1,j) have values Construct an m*n matrix B, where B(i,j)= co-occurrence(s i, t j ).
20
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 20 Integrating monolingual information (Cont.) Normalization: – Assign self2self value α(s i ) A(i,i), 0<=α(s i )<=1 – Assign self2self value β(tj) C(j,j), 0<= β(tj)<=1 – Normalize A so that:
21
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 21 Integrating monolingual information (Cont.) – Normalize C so that: – Normalize B so that:
22
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 22 Integrating monolingual information (Cont.) Calculating new src-tgt matrix B’ OK. That’s it! Yes, that’s the whole story!
23
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 23 Example With pure bilingual information: After integration with monolingual information:
24
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 24 Visualization Left: Using pure bilingual information Right: Integrated with monolingual information
25
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 25 What Is the Self2self Value? Take a look at: What actually happens is: stands for how much word Si should “make use of” its neighbours’ relation with the target words. For content words, self2self value should be higher, and for function words, it should be lower.
26
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 26 How To Set the Self2self Values Well, this is tricky Before June evaluation I set α = 0.6 for all src words and β = 0.48 for all tgt words – Not good – “the” should have lower self2self value and “Pittsburgh” should have a higher self2self value
27
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 27 Calculating Self2self Values Observation: Source language content words tend to align to a few target words with high scores while function words tend to align to many target words with low scores “has”“the” “beijing”“computer”“bus” “in”
28
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 28 Calculating Self2self Values (Cont.) Calculating the entropy of a word over the distribution of normalized co-occurrence scores – Given word s i, for all the possible co-occurred word t j, their co- occurrence score is C(i,j), – Let – Define Map the score linearly to a value between 0~1 Better map the scores to a range narrower than 0~1. E.g. 0.45~0.85, why?
29
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 29 A Modification to the Segmentation Algorithm Original algorithm calculates A*B*C only once In the modified version: – Set B[i,j] to 0 for all aligned cells when a new aligned region is found – Re-calculate A*B*C Motivation: – Since we found an aligned region, the boundary of this phrase is known. It should not affect the unaligned neighbors More computationally expensive Experiments showed better performance
30
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 30 Updating Bilingual Information by Iteration Using EM to update the bilingual co- occurrence scores – Doesn’t help too much
31
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 31 Results The Dev-test on small data track (3540 sentence pair training data + 10K glossary) NISTBleu Baseline(IBM1+Gloss)6.00970.1231 Original Algorithm6.3673 (+5.9%)0.1478 (+20.0%) Modified Algorithm6.4310 (+7.0%)0.1507 (+22.4%) After LM-fillNISTBleu Baseline(IBM1+Gloss)+LM6.37750.1417 Original Algorithm+LM6.6754(+4.7%)0.1611(+13.7%) Modified Algorithm+LM6.7987(+6.6%)0.1712(+20.8%)
32
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 32 Results (Cont.) No LM-fillWith LM-fill NISTBleuNISTBleu Baseline(IBM1+Gloss)6.00970.12316.37750.1417 HMM+IBM1+Gloss6.18020.13056.47500.1459 ARV+IBM1+Gloss6.36360.14736.74050.1681 JOY+IBM1+Gloss6.43100.15076.79870.1712 ARV+JOY+IBM1+Gloss6.51170.15696.87900.1776
33
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 33 Conclusion Simple Efficient – Unlike stochastic bracketing (Wu 95) which is O(m 3 n 3 ), the algorithm of segmenting the matrix is linear O(min(m,n)). The construction of A*B*C is O(m*n); Effective – Improved the translation quality from baseline (NIST=6.0097, Bleu=0.1231) to (NIST=6.4310, Bleu=0.1507) on small data track dev-test
34
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 34 Future work Find a better segmentation algorithm (dynamic threshold) Find a method which is mathematically more sound for self2self values Investigate the possibility of using trigram or distance bi-gram monolingual information
35
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 35 References Peter F. Brown, Stephen A. Della Pietra, Vin-cent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machinetranslation: Parameter estimation. Computa-tional Linguistics, 19 (2) :263- 311. Gaussier, E. (1998) Flow Network Models for Word Alignment and Terminology Extraction from Bilingual Corpora. In Proceedings of COLING-ACL-98, Montreal, pp. 444-450. I. Dan Melamed. "A Word-to-Word Model of Translational Equivalence". In Procs. of the ACL97. pp 490--497. Madrid Spain, 1997. I. Dan Melamed (2001). Empirical Methods for Exploiting Parallel Texts MIT Press. Stephan Vogel, Hermann Ney, and Christoph Till-mann. 1996. HMM-based word alignment in statistical translation. In COLING '96: The 16th Int. Conf. on Computational Linguistics, pages 836-841, Copenhagen, August. Dekai Wu, An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words, ACL, June 1995 Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". MT Summit VIII, Sep. 2001.
36
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 36 Acknowledgement I would like to thank Stephan Vogel, Jian Zhang, Jie Yang, Jerry Zhu, Ashish and other people for their valuable advice and suggestions during this work.
37
07/09/2002 Copyright. Joy, joy@cs.cmu.edu 37 Questions and Comments
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.