Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004
Introduction Discovering translation pairs of different languages, especially for named entities Focusing on Chinese-English NE translation Combining both phonetic and semantic information, while previous studies most dealt with single evidence only
Matching Model Segmenting NE candidates into tokens Computing token-to-token similarity score based on either phonetic information or semantic information Treating matching problem as a weighted bipartie matching Finding the maximum weighted bipartie matching as the similarity measure between two NE candidates
Tokenization & Semantic Similarity Score Looking up each token of an English candidate in the bilingual dictionary provided by LDC. Scanning the Chinese candidate to get those segments that can maximally match any of the Chinese translations in the dictionary The semantic similarity score is defined as the number of matched characters divided by the total characters of the corresponding translation. The unmatched English terms are concatenated with other adjacent unmatched terms into one token, and so are the unmatched Chinese segments For example: – English candidate: Palo Alto Chamber of Commerce – Chinese candidate: 帕洛奧托商會 – The Chinese translation of “commerce” is “ 商業 ”, and the segment “ 商 ” can maximally match this translation, so “ 商 ” would be segmented as a token – Then, the semantic similarity score between “commerce” and “ 商 ” is Len(“ 商 ”) / Len(“ 商業 ”) = 1 / 2 = 0.5. – The unmatched terms “Palo” and “Alto” are concatenated into one token. – Likewise, the unmatched segment “ 帕洛奧托 ” is treated as a single token.
Phonetic Similarity Score Getting the phonetic representation of English and Chinese candidates For example, “father” would be transformed to “faDR”, “ 港 ” would be transformed to “gang3”. Splitting the phonetic representations into basic phoneme units. – Note: There’s some questions about the original paper. Building a phoneme pronunciation similarity (PPS) table Treating the problem as a weighted longest common subsequence problem Finding the optimal longest common subsequence Normalizing the score of the optimal solution by dividing the maximum length of two sequences Using the normalized score as the phonetic similarity score of two representations
Learning Phonetic Similarity Using 20,000 English-Chinese person name pairs from C-E NE Corpus provided by LDC The names are transformed into basic phoneme units through the procedure mentioned above. The target is to maximize:
The Widrow-Hoff Algorithm Minimize: Z k is set to 1 for positive training samples and 0 for negative ones. Processing one pair of entities at each iteration, and using the following equation to update PPS table: Using a validation set to implement the terminating condition
The Exponentiated-Gradient Algorithm The top level framework of EG is similar to WH EG requires V belonging to the probability simplex. Therefore, during training, V is always maintained as a probability simplex. However, before being used to estimate similarity score, the elements in V is magnified as: And the updating formula is given by:
The Genetic Algorithm A chromosome represents all the elements in the PPS table. Each gene in a chromosome corresponds to a particular element in the table. An initial population of chromosomes is prepared. Standard genetic operators such as crossover and mutation are employed. The target is to maximize:
Experiments Experiment 1: – 2,000 person name pairs different from training and validation data are used to evaluate the learning performance. – The learning rate of WH algorithm is set to 5e – The learning rate of EG algorithm is set to – The crossover rate and mutation rate of genetic algorithm is tuned to 0.8 and respectively.
Experiments Experiment 2: – 1,000 named entities are collected to evaluate the performance of the overall NE matching model. – The pure phonetic model and the pure semantic model are also conducted for comparison
Mining New Entity Translations Unsupervised learning technique using a bilingual dictionary is employed to detect comparable news. People names, place names, and organization names are automatically extracted from the news content. For each NE, computing its cognate weight, which represents the NE’s importance in the new cluster Making both use of NE matching model and cognate weight to discover new translations
Mining New Entity Translations If both of the English and Chinese named entities are of relatively high cognate weights in a particular news cluster, they are more likely to be matched. The formula for measuring the cognate weight similarity score, Sw(E,C), is defined as follows: The final similarity score Sf (E,C) for E and C is given as follows:
Mining New Entity Translations News articles from 20 November 2003 to 20 December 2003 were collected. There are 1,599 English news and 2,476 Chinese news in total. Each news batch contains news from four consecutive days, resulting in 28 batches in total. Comparable news clusters are generated for each batch. αh was set to 0.8, and the threshold φ was set to 0.5.
Mining New Entity Translations There are in total 128 unseen name translations discovered. Suppose we only consider those discovered Chinese NE with the corresponding English entity appeared in the output. The average ARR across all 28 days for all the named entities was The ARR for person names was and the ARR for place and organization names was
Mining New Entity Translations