Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On Rival Penalization Controlled Competitive Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Probabilistic Model for Definitional Question Answering.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Web usage mining: extracting unexpected periods from web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A self-organizing neural network using ideas from the immune.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Concept similarity in Formal Concept Analysis-An information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 GMDH-based feature ranking and selection for improved.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Qing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Multiclass boosting with repartitioning Graduate : Chen,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A personal route prediction system base on trajectory.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Semantic segment extraction and matching for Internet.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Providing Justifications in Recommender Systems Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Concept Frequency Distribution in Biomedical Text Summarization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA and English Word Pairs from Bilingual Corpora Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Keita Tsuji ICCPOL (2001)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Extraction of Traslational Word Pairs Experimental Results Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation  The bilingual lexicon is in demand in many fields ─ cross-language information retrieval ─ machine translation  The bilingual corpora ─ become widely available reflecting the change in publishing activities

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective  Propose a method ─ automatically extract translational Japanese- KATAKANA and English word pairs from bilingual corpora based on transliteration rules ─

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction(1/2)  Many of the methods proposed ─ depend heavily on word frequency in the corpora ─ cannot treat low-frequency words properly  The low-frequency words ─ include newly-coined words which are especially in demand in many fields

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction(2/2)  Against these background ─ we propose a method to extract translational KATAKANA-English word pairs from bilingual corpora  The method ─ applies all the existing transliteration rules to each mora unit in a KATAKANA word ─ extract English word which matched or partially- matched to one of these transliteration candidates as translation

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extraction of Traslational Word Pairs Pick up JDecompose J Generate candidates Pick up E Identify LCSP(J, E)=maxi Dice() Transliteration rules グラフ ‘ グ ’,’ ラ ’,’ フ ’ Ti(J)=graf Ti(J)=graph graph P( グラフ, graph)=1 P( グラフ, library)=0.36 S(‘guraf’, ‘graph’) = ‘gra’) Construction of Transliteration Rule Measure for Matching

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Construction of Transliteration Rule(1/2) 1. Decompose KATAKANA word ─ ‘ ディスパッチャー ’ ─ ‘ ディ ’, ‘ ス ’, ‘ パッ ’ and ‘ チャー ’ 2. Extract transliteration rule for each unit manually ─ ‘ ディスパッチャー ’ and ‘dispatcher’ ─ ‘ ディ ’ = ‘di’ ─ ‘ ス ’ = ‘s’ ─ ‘ パッ ’ = ‘pat’ ─ ‘ チャー ’ = ‘cher’

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Construction of Transliteration Rule(2/2) 3. Repeat (1) and (2) for all the word pairs in the list ─ count the frequency of each rule ─ rank them for each KATAKANA unit 4. Add Hepburn transliteration rules into the above rules ─ If the source list is large enough, this process might not be necessary

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Assume  When Japanese borrows an English word ─ the word is transliterated on the basis of Japanese mora unit ─ their correspondences are stable  These transliterations are free from context ─ have no relation to the preceding ─ following mora units  The number of transliteration rules is small  They do not vary drastically from time to time nor from domain to domain

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extraction of Traslational Word Pairs  Pick up J and decompose it into units ─ according to the same framework we used at TR construction  Using all the transliteration rules in TR ─ generate all the possible transliteration candidates for J ─ Henceforth we represent the i-th transliteration candidate of J as T i (J)  Pick up E which co-occurred with J ─ occurred in the same aligned segment in the corpus ─ identify the longest common subsequence with each T i (J)  If the following P(J, E) exceeds certain threshold ─ extract pair J and E as translation P(J, E)=max i Dice(L(S(T i (J), E)), L(T i (J)), L(E))

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Extraction of Traslational Word Pairs Pick up JDecompose J Generate candidates Pick up E Identify LCSP(J, E)=maxi Dice() Transliteration rules グラフ ‘ グ ’,’ ラ ’,’ フ ’ Ti(J)=graf Ti(J)=graph graph P( グラフ, graph)=1 P( グラフ, library)=0.36 S(‘guraf’, ‘graph’) = ‘gra’) Construction of Transliteration Rule Measure for Matching

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dice Measure  J: KATAKANA word in the corpus  E: English word in the corpus  L(w): number of characters of word w  T(J): transliteration candidate of word J  S(w 1, w 2 ): longest common subsequence of w 1 and w 2 ─ (e.g. S(‘guraf’, ‘graph’) = ‘gra’)  Dice(k,m,n)=k*2/(m+n) Dice(L(S(Ti(J), E)), L(Ti(J)), L(E)) Dice(L(S(graf, graph)), L(graf), L(graph)) Dice(3,4,5)=3*2/(4+5)=0.67 Ti(J)=graf Ti(J)=graph

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Example  P( グラフ, graph) = max( Dice(L(S(graf, graph)), L(graf), L(graph)), Dice(L(S(graph, graph)), L(graph), L(graph)), Dice(L(S(graff, graph)), L(graff), L(graph)), … Dice(L(S(gulerfe, graph)), L(gulerfe), L(graph))) = max(0.67, 1.00, 0.60, …, 0.33, 0.33) = 1.00  P( グラフ, library) =max(0.36, 0.33,..., 0.29, 0.29)=0.36

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Device for Time-saving(1/2)  Procedure requires much computational time when applied to the actual data ─ using all rules in TR often leads to the combinatory explosion of the number of transliteration candidates ─ identifying the longest common subsequence often requires much time

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Device for Time-saving(2/2)  Apply less transliteration rules to longer KATAKANA word ─ at TR construction, we have ranked transliteration rules according to their frequencies ─ applied top 12/(the number of units in J)+1 rules to each unit of J 3*6*4 candidates 12/3+1=5 3*5*4 candidates 364

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Other Alternatives(1/2)  Transliteration Rule (TR) ─ comparing the effectiveness of TR and HR (Hepburn transliteration rule) ─ HR is an well-known rule for transliterating Japanese to alphabet strings and is easily available ─ HR gives each KATAKANA unit a unique alphabet string TR usually generates many T(J) HR generates only one T(J). If HR alone can produce good result, we can save computational time

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Other Alternatives(2/2)  Measure for Matching ─ Our method uses the combination of Dice and NPT_score ─ Mdic was used on behalf of Dice: Mdic(k, m, n)=(1+logk)*k*2/(m+n) ─ Bgrm was used on behalf of the combination of Dice and NPT_score: Bgrm(T(J), E)=|NT(J) ∩ NE|/|NT(J) ∪ NE|

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment Data(1/2)  Source Data for TR ─ from 3,742 transliterational term pairs in dictionary of artificial intelligence ─ added Hepburn transliteration rules to them  Bilingual Corpora ─ eight bilingual corpora, Japanese-English parallel abstracts/titles of academic papers, ─ domains of these corpora are artificial intelligence (AI), forestry (FR), information processing (IP) and architecture (AC)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiment Data(2/2)  The purpose of using abstract and title corpora ─ examine the influence of size and the degree of parallelism of each segment to the extraction results ─ abstracts are larger and noisier than titles  The purpose of using four domains ─ examine the influence of domain difference to the results ─ results of the other three domains do not significantly differ from that of artificial intelligence ─ need not be so nervous about the domain of source data for TR construction

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Results(1/3)  TR is much more effective than simple HR(Hepburn transliteration rules) ─ TR_Dice achieved 93% precision at 75% recall ─ HR_Dice at 75% recall remained 4%  Dice is more effective than Mdic ─ many word pairs which are long and morphologically- related but not translational ─ Mdic between these word pairs tend to become higher than those of short translational pairs, which leads to the decrease of precision

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Results(2/3)  The combination of Dice and NPT_score is more effective than Bgrm  Our method is more effective than using combination of NPT and Match ─ in [1] tends to generate strings which contain incorrect translations. ‘ グラフ ’ is transliterated into ‘ghurlaoffphu’ contains not only correct translation ‘graph’, but also incorrect one ‘hop’. ─ Match depends only on the length of English words and their matched parts ─ Therefore, it tends to evaluate short English words Match(‘ghurlaoffphu’, ‘hop’)=3/3=1)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Results(3/3)  TR which was constructed based on artificial intelligence term pairs also performed well against the other three domains ─ This indicates the domain-independence of TR ─ we do not have to construct TR for each domain  Our method achieved % precision at 75% recall against title corpora  Our method performed well against abstract corpora (83-93% precision at 75% recall)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Error Analysis(1/2)  Most of the translational word pairs not extracted were those containing transliteration units which were not listed in TR ─ For instance ‘ アーキテクチャ ’ and ‘architecture’ was not extracted because ‘ キ ’ = ‘chi’ , ‘ チャ ’ = ‘ture’ were not listed in TR ─ We can solve this problem by enriching TR

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Error Analysis(2/2)  A few of the translational word pairs which were not extracted were acronyms and their original forms ─ cannot be extracted based on transliterations by nature  Many of the non-translational word pairs wrongly extracted were morphologically related pairs. ─ For instance ‘ プログラマ ’(programmer) and ‘program’ , ‘ クラス ’(class) and ‘subclass’ ─ Emphasizing the first and the last unit matching might be effective

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion  We proposed a method ─ for extracting translational KATAKANA-English word pairs from bilingual corpora  The experiment ─ shows our method is highly effective  By enriching TR and introducing some heuristics ─ the performance will become higher

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion  Advantage ─ Extraction of Traslational Word Pairs ─ Error Analysis  Dis advantage ─ Construction of Transliteration Rule ─ Assume and limit