Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A 24-h forecast of solar irradiance using artificial neural.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 On Rival Penalization Controlled Competitive Learning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Presenter : Chien-Hsing Chen Author: Jong-Hoon Oh Key-Sun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Probabilistic Model for Definitional Question Answering.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Web usage mining: extracting unexpected periods from web.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Concept similarity in Formal Concept Analysis-An information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Adaptation of the Vector-Space Model for Ontology-Based.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
國立雲林科技大學 National Yunlin University of Science and Technology Self-organizing map learning nonlinearly embedded manifoldsmanifolds Author :Timo Simila.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fraud detection in online consumer reviews Presenter: Tsai Tzung Ruei Authors: Nan Hu, Ling Liu, Vallabh.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Qing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Study of Learning a Merge Model for Multilingual Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Concept Frequency Distribution in Biomedical Text Summarization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Key Blog Distillation: Ranking Aggregates Presenter : Yu-hui Huang Authors :Craig Macdonald, Iadh Ounis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Discovering Interesting Usage Patterns in Text Collections:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text Classification, Business Intelligence, and Interactivity:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Electricity Based External Similarity of Categorical Attributes.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in Cross- Language Information Retrieval Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Ying Zhang and Phil Vines

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2  Motivation  Objective  Previous work  Methodology  Experiments and results  Conclusions Outline

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms. ─ it will not be recognized, and segmented into either smaller sequences of characters or individual characters ─ 北野武 →(north limit military)  Previous work has either relied on manual intervention or has only been partially successful in solving this problem.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  We propose a segmentation free method which can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 English translation extraction in Chinese-English CLIR  Chinese OOV term detection ─ 北野武 (north limit military) → P value given by the HMM will be very low if P value < P min → contains OOV terms  web text extraction ─ we extract strings that contain the Chinese query terms and some English text from the Web.  collection of co-occurrence statistics,  translation selection. search for longest Chinese substring C t :search for the English term e t with the highest frequency: 1. |C targets | = max(|C ij |). 2. f(e t, C t ) = max(f(e i,C targets )). 3. Add (C t, e t ) into the translation dictionary. 1.f(e targets ) = max(f(e i )). 2.f(e t’,C t’ ) = max(f(e targets,C ij )). 3. if C t’ ≠ C t and e t’ ≠ e t, add (C t’, e t’ ) into the translation dictionary. 北野武北野武( Kitano Takeshi ) c 4 c 5 c 6 e 1 導演北野武 ( Kitano Takeshi ) c 2 c 3 c 4 c 5 c 6 e 1

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Chinese translation extraction in English-Chinese CLIR  Extraction of web text ─ use Google to fetch the top100 Chinese documents with the English OOV term e oov as the query.  Collection of co-occurrence statistics ─ ─ accumulate the frequency f oov. ─ considering all substrings in S left and S right, and collecting the frequency f n and the length |s n | of each Chinese substring. ─  Translation selection ─ exclude any substring that already in the translation dictionary doesn’t occur in the document collection 與區域貿易 ……… 兩岸經貿關係( Canada and Cross Straits Economic Relations )三組發表九 ……… 英茂、加 S left ( 包含 20 個字 )S right ( 包含 20 個字 ) e oov Longest length Highest frequency

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Experiments and results Chinese-English CLIREnglish-Chinese CLIR 30 10

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Introduction  When translating from Chinese to English, a standard first step is to segment the text into words based on an existing segmentation dictionary.  However where an OOV term occurs, it will not be recognized, and segmented into either smaller sequences of characters or individual characters.  We propose a segmentation free method based on frequency and length analysis and corpus-based disambiguation

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Previous work  Dictionary-based translation schemes need to address three major issues ─ phrase identification and translation ex. non proliferation treaty and cross straits. ─ translation ambiguity using techniques such as term co-occurrence, mutual information or language modeling. ─ out of vocabulary (OOV) terms. ex. Dioxin

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Previous work- Existing approaches to the OOV problem  Depending on the language, it may be possible to deduce appropriate transliterated translations automatically. ─ that they successfully applied in English-Arabic CLIR.  However the issue is more difficult in Chinese as many characters have the same sound, and many English syllables do not have equivalent sounds in Chinese, meaning that selecting the correct characters to represent a transliterated word can be problematic. ─ cross straits( 兩岸 ) 、北野武 (north limit military)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Previous work- Segmentation free translation extraction  It is common to find a small amount of English text in Chinese web documents, but extremely rare to find Chinese text in English web documents.  We therefore rely on Chinese web documents to extract translations in both directions.  The problem is that the Chinese OOV term we are looking for is currently unknown, and thus we have no information about how it should be segmented. ─ In previous work, this problem was overcome by manual intervention to provide appropriate segmentation.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Experiments and results  Chinese-English CLIR ─ retrieving English documents using Chinese queries

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Experiments and results (cont.)  English-Chinese CLIR ─ retrieving Chinese documents using English queries. The aim of our work is to find appropriate Chinese translations of English OOV terms

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Conclusions  We have also described improved ways to extract the translation of OOV terms from the Web in a way that does not rely on prior segmentation.  Although the Web is constantly changing, we were able to find most OOV terms, many of which related to news events up to 10 years ago.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 My opinion Advantage: Segmentation free translation extraction Disadvantage: Apply: 線上翻譯 ………..