Download presentation
Presentation is loading. Please wait.
Published byWarren Davis Modified over 9 years ago
1
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004
2
Introduction Focusing on extracting entity names (PER, LOC, ORG) in bilingual corpus. The feasibility of extracting interlingual NEs has seldom been addressed. – Al Onaizan and Knight 2002 – Huang and Vogel 2002 – Chen et al. 2003 – Moore 2003 – Kumano et al. 2004 – Lee et al. (Baseline Model) 2003 Integrating approximate matching and personal name recognition into the baseline model.
3
Framework 1. Preprocess: 1) Perform sentence alignment. 2) Label English named entities. 2. Main process: 1. For each labeled NE E, apply Statistical Probability Translation Model and Approximate Matching to find Chinese named-entity candidates {NE A } in S C. 2. For any word W E, in NE E, that cannot find the corresponding Chinese translation in S C, apply the proposed Statistical Transliteration Model, enhanced with Chinese Personal Name Recognition to extracting the corresponding Chinese transliterations {NE B }, in S C, with scores above a predefined threshold. 3. Merge {NE A } with {NE B } into possible candidates {NE C }. 4. Rank {NE C } by the cost scores. The candidate with the maximum score is chosen as the answer.
4
SPTM A noisy channel approach Translating an English phrase e with l words into a Mandarin Chinese phrase f with m words by decomposing the channel function into two independent probabilistic functions: – Lexical translation probability function P(f ai | e i ) where e i is the i -th word in e and e i is aligned with f ai in f under the alignment a – Alignment probability function P(a | l, m) = P(a 1, a 2, …, a l | l, m)
5
SPTM E = “Ichthyosis Concern Association” F = “ 關懷 魚鱗癬 協會 ” Correct alignment: (a1 = 2, a2 = 1, a3 = 3). The phrase translation probability is Defining the scoring function as a log probability function:
6
Estimating Lexical Translation Probability Based on Parallel Corpus Adopting a word alignment module to automatically extracting lexical translation probabilities. (Wu and Chang 2003) 1. Developing a list of preferred part-of-speech (POS) patterns of collocation in both languages 2. Conducting collocation candidates matching to the preferred POS patterns and apply N-gram statistics for both languages 3. The log likelihood ratio statistics is employed for two consecutive words in both languages 4. Finally, we deploy content word alignment based on the Competitive Linking Algorithm (Melamed 1997). For the purpose of not introducing too much noise, only bilingual phrases with high probabilities are considered.
7
Estimating Lexical Translation Probability Based on Transliteration Model Adopting a Romanization system to represent a Chinese word E and F are assumed to be an English word and a Romanized Chinese character sequence, respectively. The transliteration probability P(F|E) can be approximated by decomposing E and F into transliteration units (TUs). A word E with l characters and a Romanized word F with m characters are denoted by e 1 e 2 …e l and f 1 f 2 …f m respectively. We can represent the mapping of (E, F) as a sequence of matched n TUs: {(u 1, v 1 ), (u 2, v 2 ), … (u n, v n ) }. The alignment a between E and F can be represented as a sequence of match type (m 1 m 2 …m n ) where m i denotes as a pair of lengths of u i and v i.
8
Estimating Lexical Translation Probability Based on Transliteration Model
9
NE alignment 1. g(0,0) = 0 2. 3. Suppose that there is an entry (e i,w f ) in the bilingual dictionary. Score lex (f ai | e i ) is formulated as:
10
Approximate Matching
11
CPNR Chinese surnames are used as anchor points. The Chinese personal name recognizer is applied only on the case that the given NE is a named person and Score tm (R(f ai ) | e i ) is less than Thr 1.
12
Training Data Noun phrases of the BDC Electronic Chinese-English Dictionary were used to train PTM. – To train the transliteration model, 2,430 pairs of English names together with their Chinese transliterations and Chinese Romanization were used. The LDC Central News Agency Corpus was used to extract keywords of entity names for identifying NE types. We collected 117 bilingual keyword pairs from the corpora. A list of Chinese surnames was also gathered to help to identify and extract the PER-type of NEs. The parallel corpus collected from the Sinorama Magazine was used to construct the corpus-based lexicon and estimation of LTP.
13
Experiments 275 aligned sentences from Sinorama are randomly selected. Answer keys are manually prepared. Each chosen aligned sentence contains at least one NE pair. Currently, the lengths of English NEs are restricted to be less than 6. In total, 830 pairs of NEs are labeled. The numbers of NE pairs for types PER, LOC, and ORG are 208, 362, and 260, respectively.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.