Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
Patch to the Future: Unsupervised Visual Prediction
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Lecture 6 Image Segmentation
Non-Linear Problems General approach. Non-linear Optimization Many objective functions, tend to be non-linear. Design problems for which the objective.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Chapter 14 Genetic Algorithms.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
P OPULATION -B ASED I NCREMENTAL L EARNING : A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning 吳昕澧 Date:2011/07/19.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Genetic Algorithms Nehaya Tayseer 1.Introduction What is a Genetic algorithm? A search technique used in computer science to find approximate solutions.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Over the last years, the amount of malicious code (Viruses, worms, Trojans, etc.) sent through the internet is highly increasing. Due to this significant.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
EE368 Soft Computing Genetic Algorithms.
Handwritten Characters Recognition Based on an HMM Model
Presentation transcript:

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004

Introduction Discovering translation pairs of different languages, especially for named entities Focusing on Chinese-English NE translation Combining both phonetic and semantic information, while previous studies most dealt with single evidence only

Matching Model Segmenting NE candidates into tokens Computing token-to-token similarity score based on either phonetic information or semantic information Treating matching problem as a weighted bipartie matching Finding the maximum weighted bipartie matching as the similarity measure between two NE candidates

Tokenization & Semantic Similarity Score Looking up each token of an English candidate in the bilingual dictionary provided by LDC. Scanning the Chinese candidate to get those segments that can maximally match any of the Chinese translations in the dictionary The semantic similarity score is defined as the number of matched characters divided by the total characters of the corresponding translation. The unmatched English terms are concatenated with other adjacent unmatched terms into one token, and so are the unmatched Chinese segments For example: – English candidate: Palo Alto Chamber of Commerce – Chinese candidate: 帕洛奧托商會 – The Chinese translation of “commerce” is “ 商業 ”, and the segment “ 商 ” can maximally match this translation, so “ 商 ” would be segmented as a token – Then, the semantic similarity score between “commerce” and “ 商 ” is Len(“ 商 ”) / Len(“ 商業 ”) = 1 / 2 = 0.5. – The unmatched terms “Palo” and “Alto” are concatenated into one token. – Likewise, the unmatched segment “ 帕洛奧托 ” is treated as a single token.

Phonetic Similarity Score Getting the phonetic representation of English and Chinese candidates For example, “father” would be transformed to “faDR”, “ 港 ” would be transformed to “gang3”. Splitting the phonetic representations into basic phoneme units. – Note: There’s some questions about the original paper. Building a phoneme pronunciation similarity (PPS) table Treating the problem as a weighted longest common subsequence problem Finding the optimal longest common subsequence Normalizing the score of the optimal solution by dividing the maximum length of two sequences Using the normalized score as the phonetic similarity score of two representations

Learning Phonetic Similarity Using 20,000 English-Chinese person name pairs from C-E NE Corpus provided by LDC The names are transformed into basic phoneme units through the procedure mentioned above. The target is to maximize:

The Widrow-Hoff Algorithm Minimize: Z k is set to 1 for positive training samples and 0 for negative ones. Processing one pair of entities at each iteration, and using the following equation to update PPS table: Using a validation set to implement the terminating condition

The Exponentiated-Gradient Algorithm The top level framework of EG is similar to WH EG requires V belonging to the probability simplex. Therefore, during training, V is always maintained as a probability simplex. However, before being used to estimate similarity score, the elements in V is magnified as: And the updating formula is given by:

The Genetic Algorithm A chromosome represents all the elements in the PPS table. Each gene in a chromosome corresponds to a particular element in the table. An initial population of chromosomes is prepared. Standard genetic operators such as crossover and mutation are employed. The target is to maximize:

Experiments Experiment 1: – 2,000 person name pairs different from training and validation data are used to evaluate the learning performance. – The learning rate of WH algorithm is set to 5e – The learning rate of EG algorithm is set to – The crossover rate and mutation rate of genetic algorithm is tuned to 0.8 and respectively.

Experiments Experiment 2: – 1,000 named entities are collected to evaluate the performance of the overall NE matching model. – The pure phonetic model and the pure semantic model are also conducted for comparison

Mining New Entity Translations Unsupervised learning technique using a bilingual dictionary is employed to detect comparable news. People names, place names, and organization names are automatically extracted from the news content. For each NE, computing its cognate weight, which represents the NE’s importance in the new cluster Making both use of NE matching model and cognate weight to discover new translations

Mining New Entity Translations If both of the English and Chinese named entities are of relatively high cognate weights in a particular news cluster, they are more likely to be matched. The formula for measuring the cognate weight similarity score, Sw(E,C), is defined as follows: The final similarity score Sf (E,C) for E and C is given as follows:

Mining New Entity Translations News articles from 20 November 2003 to 20 December 2003 were collected. There are 1,599 English news and 2,476 Chinese news in total. Each news batch contains news from four consecutive days, resulting in 28 batches in total. Comparable news clusters are generated for each batch. αh was set to 0.8, and the threshold φ was set to 0.5.

Mining New Entity Translations There are in total 128 unseen name translations discovered. Suppose we only consider those discovered Chinese NE with the corresponding English entity appeared in the output. The average ARR across all 28 days for all the named entities was The ARR for person names was and the ARR for place and organization names was

Mining New Entity Translations