Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web-based acquisition of Japanese katakana variants

Similar presentations


Presentation on theme: "Web-based acquisition of Japanese katakana variants"— Presentation transcript:

1 Web-based acquisition of Japanese katakana variants
Advisor Dr. Hsu Reporter Wen-Hsiang Hu Author Takeshi Masuyama; Hiroshi Nakagawa 2005, SIGIR

2 Outline Motivation Objective Introduction
ACQUISITION OF STRING PENALTY WITH WEB DATA EXTRACTION OF KATAKANA VARIANT PAIRS CONCLUSIONS AND FUTURE WORK Personal Opinion

3 Motivation Previous works manually :
defined Katakana rewrite rules. %Y(be) and %t%’(ve) being replaceable with each other defined the weight of each operation to edit one string into another to detect these variants. The weight of substitutions %Y(be) and %t%’(ve) is 0.8 However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. 當我們只用一種外來字去搜尋資訊,就會失去其他相同意義的外來字所搜尋出來的資訊

4 Objective Acquire new weights of edit operations automatically
keep up with new Katakana loanwords only by collecting text data from Web and.

5 ACQUISITION OF STRING WITH WEB DATA
(%&%)%C%+(wholtuka), %&%)%H%+(wholtoka)), (%&%)%C%+(wholtuka), %&%*%C%+(uoltuka)), (%&%)%C%+(wholtuka), %t%)%C%+(voltuka)) Collect candidate Katakana variant pairs threshold of edit distance : 2 Vodka and %&%)%C%+(wholtuka) Google threshold: Calculate the string penalty (SP) stop-words Extract Katakana variant pairs CLC : character-level context e.g. f(oltuka)=2 f(oltuka , w←>u)=1 f(oltuka , w←>v)=1

6 EXTRACTION OF KATAKANA VARIANT PAIRS
%_%M%i%k%&%)!<%?!<(mineraruwho-ta- for “mineral water”) %_%M%i%k%&%*!<%?(mineraruuo-ta for “mineral water”) We collect Katakana words from the corpus. We used the pattern matching of a Katakana character set. threshold of string penalty (SP) : 4 Extract candidate Katakana variant pairs e.g. !&(“bullet”), !<(“macron-1”), !](“macron-2”), !=(“macron-3”) to collect Katakana words such as %_%M%i%k%&%)!<%?!< (mineraruwho-ta- for “mineral water”). threshold of cosine similarity : 0.05 Extract Katakana variant pairs

7 Experiment We conducted paired t-test (rejection region: 5%)
for the cases of SP = 1, 2, and 3 and no significant difference is detected.

8 Introduction The pronunciation of loanwords does not necessarily coincide with that in their original language.

9 Introduction (cont.) We tried to find how many documents were retrieved by Google when each Katakana variant for spaghetti was used as a query.

10 Introduction (cont.) We will first describe methods based on rewrite rules, which are described in Table 3. Henceforth, ↔ denotes substitution, ∅ denotes an empty string,… For example, when they inputted %Y%M%A%" (benechia for “Venezia”) into their system which applies rewrite rules, %Y %M %D%# %“ (benetsia) %t%’ %M %A %“ (venechia) %t%’ %M %D%# %“ (venetsia) 1. 因為玩家容易對靜態.一成不變的NPCs產生厭煩, 所以adaptation可以動態的改變NPCs戰略

11 Introduction (cont.) It is difficult to keep up with the ever-increasing number of loanwords and their variants, since they define rewrite rules manually or assign weights to the edit distance manually. We propose a method of mechanically determining the weights of the string penalty to overcome this problem.

12 Calculation of a string penalty
We used the following five types as character-level contexts (CLC) of each character targeted by the edit operation. The preceding two characters of the target character, The preceding character of the target character, The succeeding two characters of the target character, The succeeding character of the target character, and The preceding character and the succeeding character of the target character.

13 Experimental evaluation of a string penalty
Table 6: Correlation of the mechanically determined SP and the manually determined SP. Cov(XY)=E(XY)-E(X)E(Y) We calculated coefficient of correlation of Table 6 and the value was 0.76.=> strong

14 Experimental evaluation of Katakana variant pairs (cont.)

15 Comparative results for task of detecting Katakana variants
Table 10 compares the results for Mechanical, Word, Google, and Yahoo! in terms of detecting Katakana variants of “spaghetti.”

16 Error Analyses Mechanical could not extract the variant pair %0%j%:%j!<%Y%"(gurizuri-bea) and %0%j%:%j!<!&%Y%"(gurizuri-!&bea) , both of which denoted “grizzly bear,” since their document-level contexts were completely different.

17 CONCLUSIONS AND FUTURE WORK
We proposed a method of mechanically determining the weight of each edit operation for identifying Katakana variants, based on Web data. Unlike methods presented in previous work, ours could easily keep up with the increasing number of loanwords. We also proposed a method of extracting Japanese Katakana variant pairs from a large corpus based on similarities in spelling and context. In our future work, we are planning to calculate SP with a list of words in other languages and Katakana loanwords.

18 Personal Opinion Strength automatic method Application 柯林頓 科林頓 克林頓


Download ppt "Web-based acquisition of Japanese katakana variants"

Similar presentations


Ads by Google