Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Phonetic String Matching:Lessons from Information Retrieval.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A self-organizing neural network using ideas from the immune.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Evolving Reactive NPCs for the Real-Time Simulation Game.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Qing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Visualization of multi-algorithm clustering for better economic decisions - The case of car pricing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Loss of the Mahalanobis Distance in High Dimensions-
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Concept Frequency Distribution in Biomedical Text Summarization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology IEEE EC1 Generating War Game Strategies Using A Genetic.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 f-information measures in medical image registration Presenter.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Fuzzy Translation of Cross-Lingual Spelling Variants SIGIR’03

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Method & Data Findings Discussion & Conclusions

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation The limitation on CLIR performance. Some terms not in translation dictionaries. Fuzzy matching ~ n-gram method.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Two-step fuzzy translation technique for cross- lingual spelling variants to improve the CLIR performance Transformation rule based translation, TRT. Translate the intermediate forms into a target language using fuzzy matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR. Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl. Similarity measure N-gram Fuzzy matching Transliteration

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction In this paper, the technique transformation rule based translation, TRT Close to transliteration, but no phonetic elements. It’s suitable for cross-lingual spelling variants. Example : Spanish embr i olog ia =>English embr y olog y Problem: How to automatically find this rule? Equivalent term pairs extracted from a translation dictionary and aligned pairwise. Edit distance.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Introduction Two-step fuzzy translation Source words are translated into intermediate forms based on TRT, in order to render a source word more similar to its target equivalent. The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Method & Data - Overview (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … Translation dictionary TRT Intermediate form N-gram Matching High confidence factor, HCF Low confidence factor, LCF Translation Strategies Example: konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) => convection

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Method & Data - TRT (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … Translation dictionary Edit distance 0, the same character at the same position 1, consonant-consonant, vowel-vowel substitution 1, insertion or deletion of a character 2, consonant-vowel, vowel-consonant substitution Selection of proper terms and error value One transformation was selected which have the smallest sum of error values Rule: on -> o ugh n at middle position threshold (embriologia, embryology) (embriolagia, embryology) (embrialagia, embryology) … minimum ED

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Transformation Rule based Translation Edit Distance Automatic Generation of Rules Extracting similar terms from a dictionary with edit distance threshold. Selection of proper terms with the smallest sum of error values. Generation of transformation rules Context Information, Frequency, and Confidence Factor Sample Rules

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Edit Distance ED(A, B) = min{N sub + N ins + N del } {d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 A sample of Spanish-to-English rules

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Translation Resources Multilingual medical dictionary by Andre Fairchild. A Finnish list of medical terms (n=5970) A Swedish list of medical terms (n=657) Language pairs Finnish-English French-English German-English Spanish-English Swedish-English

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Target Word List and Source Words Target word list The index of CLEF’s LA Time collection, which contains words. Source words First source word list, 217 word tuples 72 training word tuples, 145 test word tuples. Second source word list 126 test word tuples. Experiments dataset 5(language)*( )words =1355 words

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 N-gram Matching Similarity measure between the source and target words w 1 and w 2. where N i refers to the set of n-grams derived from the word w 1 and w 2. Digrams v.s. Trigrams Trigrams performed worse than digrams, but sometimes gave better results than digrams.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Translation Strategies - High confidence factor (HCF) strategy A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations. Reading order The location of the rules in source words: end, beginning, and middle. The source string length: the longest first. Confidence factor: the highest first. Example konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) convetcion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Translation Strategies - Low confidence factor (LCF) strategy A threshold confidence factor of 10% was used to filter out unreliable rules. Even more intermediate forms were obtained, but it may be incorrect transformations. Both in HCF and LCF the rules whose frequency was < 50 were removed.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Evaluation For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching More words share the same SIM value Worst position: the last word Average position precision: the middle of the set of the words

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Findings Four test word types Medical, biological, and chemical terms (Bio terms), n=90 Place names, n=55 Economics, n=31 Technology, n=36 Miscellaneous, n=59 Five language pairs Finnish-English French-English German-English Spanish-English Swedish-English

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Findings – 1/3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Findings – 2/3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Findings – 3/3

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Discussion & Conclusion Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. In this study, two-step fuzzy translation Automatically generated transformation rules, TRT Fuzzy matching Two translation strategies were tested, HCF & LCF Digram and trigam matching were tesed in combination with TRT

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Discussion & Conclusion Effectiveness of fuzzy translation depends on The frequency of identical terms shared by a source and a target language. The extent of variation in the spelling variants between a source and a target language. Fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Personal opinion How did we apply this ideas to our lab.? TRT?