Web-based acquisition of Japanese katakana variants

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Aki Hecht Seminar in Databases (236826) January 2009
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Modern Information Retrieval Chapter 4 Query Languages.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Similarity Measure Based on Partial Information of Time Series Advisor : Dr. Hsu Graduate : You-Cheng Chen Author : Xiaoming Jin Yuchang Lu Chunyi Shi.
Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs ReporterHsan-Yu Lin.
Shuyi Zheng, Di Wu, Ruihua Song, Ji-Rong Wen Microsoft Research Asia SIGKDD-2007, San Jose, California, USA.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Evolving Reactive NPCs for the Real-Time Simulation Game.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A method of extracting malicious expressions in bulletin board systems by using context analysis Presenter:
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
An Automatic Construction of Arabic Similarity Thesaurus
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
A research literature search engine with abbreviation recognition
Data Integration for Relational Web
Improved Word Alignments Using the Web as a Corpus
Presentation transcript:

Web-based acquisition of Japanese katakana variants Advisor : Dr. Hsu Reporter Wen-Hsiang Hu Author Takeshi Masuyama; Hiroshi Nakagawa 2005, SIGIR

Outline Motivation Objective Introduction ACQUISITION OF STRING PENALTY WITH WEB DATA EXTRACTION OF KATAKANA VARIANT PAIRS CONCLUSIONS AND FUTURE WORK Personal Opinion

Motivation Previous works manually : defined Katakana rewrite rules. %Y(be) and %t%’(ve) being replaceable with each other defined the weight of each operation to edit one string into another to detect these variants. The weight of substitutions %Y(be) and %t%’(ve) is 0.8 However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants. 當我們只用一種外來字去搜尋資訊,就會失去其他相同意義的外來字所搜尋出來的資訊

Objective Acquire new weights of edit operations automatically keep up with new Katakana loanwords only by collecting text data from Web and.

ACQUISITION OF STRING WITH WEB DATA (%&%)%C%+(wholtuka), %&%)%H%+(wholtoka)), (%&%)%C%+(wholtuka), %&%*%C%+(uoltuka)), (%&%)%C%+(wholtuka), %t%)%C%+(voltuka)) Collect candidate Katakana variant pairs threshold of edit distance : 2 Vodka and %&%)%C%+(wholtuka) Google threshold: 0.00006 Calculate the string penalty (SP) stop-words Extract Katakana variant pairs CLC : character-level context e.g. f(oltuka)=2 f(oltuka , w←>u)=1 f(oltuka , w←>v)=1

EXTRACTION OF KATAKANA VARIANT PAIRS %_%M%i%k%&%)!<%?!<(mineraruwho-ta- for “mineral water”) %_%M%i%k%&%*!<%?(mineraruuo-ta for “mineral water”) We collect Katakana words from the corpus. We used the pattern matching of a Katakana character set. threshold of string penalty (SP) : 4 Extract candidate Katakana variant pairs e.g. !&(“bullet”), !<(“macron-1”), !](“macron-2”), !=(“macron-3”) to collect Katakana words such as %_%M%i%k%&%)!<%?!< (mineraruwho-ta- for “mineral water”). threshold of cosine similarity : 0.05 Extract Katakana variant pairs

Experiment We conducted paired t-test (rejection region: 5%) for the cases of SP = 1, 2, and 3 and no significant difference is detected.

Introduction The pronunciation of loanwords does not necessarily coincide with that in their original language.

Introduction (cont.) We tried to find how many documents were retrieved by Google when each Katakana variant for spaghetti was used as a query.

Introduction (cont.) We will first describe methods based on rewrite rules, which are described in Table 3. Henceforth, ↔ denotes substitution, ∅ denotes an empty string,… For example, when they inputted %Y%M%A%" (benechia for “Venezia”) into their system which applies rewrite rules, %Y %M %D%# %“ (benetsia) %t%’ %M %A %“ (venechia) %t%’ %M %D%# %“ (venetsia) 1. 因為玩家容易對靜態.一成不變的NPCs產生厭煩, 所以adaptation可以動態的改變NPCs戰略

Introduction (cont.) It is difficult to keep up with the ever-increasing number of loanwords and their variants, since they define rewrite rules manually or assign weights to the edit distance manually. We propose a method of mechanically determining the weights of the string penalty to overcome this problem.

Calculation of a string penalty We used the following five types as character-level contexts (CLC) of each character targeted by the edit operation. The preceding two characters of the target character, The preceding character of the target character, The succeeding two characters of the target character, The succeeding character of the target character, and The preceding character and the succeeding character of the target character.

Experimental evaluation of a string penalty Table 6: Correlation of the mechanically determined SP and the manually determined SP. Cov(XY)=E(XY)-E(X)E(Y) We calculated coefficient of correlation of Table 6 and the value was 0.76.=> strong

Experimental evaluation of Katakana variant pairs (cont.)

Comparative results for task of detecting Katakana variants Table 10 compares the results for Mechanical, Word, Google, and Yahoo! in terms of detecting Katakana variants of “spaghetti.”

Error Analyses Mechanical could not extract the variant pair %0%j%:%j!<%Y%"(gurizuri-bea) and %0%j%:%j!<!&%Y%"(gurizuri-!&bea) , both of which denoted “grizzly bear,” since their document-level contexts were completely different.

CONCLUSIONS AND FUTURE WORK We proposed a method of mechanically determining the weight of each edit operation for identifying Katakana variants, based on Web data. Unlike methods presented in previous work, ours could easily keep up with the increasing number of loanwords. We also proposed a method of extracting Japanese Katakana variant pairs from a large corpus based on similarities in spelling and context. In our future work, we are planning to calculate SP with a list of words in other languages and Katakana loanwords.

Personal Opinion Strength automatic method Application 柯林頓 科林頓 克林頓