Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: Hichem.
Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Web-Page Summarization Using Clickthrough Data Advisor.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Human eye sclera detection and tracking using a modified.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Probabilistic Model for Definitional Question Answering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A quantitative stock prediction system based on financial news Presenter : Chun-Jung Shih Authors :Robert.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Plagiarism Detection Technique for Java Program Using.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Rival-Model Penalized Self-Organizing Map Yiu-ming Cheung.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Semantic segment extraction and matching for Internet.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining concept maps from news stories for measuring civic scientific literacy in media Presenter :
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Predicting corporate bankruptcy using a self-organizing map: An empirical study to improve the forecasting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Concept Frequency Distribution in Biomedical Text Summarization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Tao-Hsing Chang Chia-Hoang Lee 國立雲林科技大學 National Yunlin University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Prediction model building and feature selection with support.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction System Overview Term Extraction and Selection Discriminative Term Selection Indexing And Classification Experimental Result Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. The extracted terms often determine system performance. N-grams are typically employed for textual indexing. Need comparatively higher storage space. N-gram is not a meaningful unit in linguistics Inconsistencies problem. Unknown words presented are more domain-specific than traditional words. Domain dependency

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Two main methods for detecting unknown words Statistical Some of which are restricted to particular type Rule-based Using dictionary Need part-of-speech information Limited length unknown word

Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview T1 新聞 T2 體育 n=1~8 document j

Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection Phrase-like Unit (PLU) A frequently occurring word sequence P, if a word w i in the sequence P and the preceding word w 1 w 2 …w i is always followed by the word sequence w i+1 w i+2 … P is probably an unknown word or phrase For example, 陳水扁 PLU-base likelihood ratio PLR(p) 陳水扁 250 陳 1000 水扁 200

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection A word sequence p is considered an unknown word if n>1 tf (p)>=c PLR(p) >= 1-εor PLR(p)*tf(p) >= d

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Further Purification Some PLUs are useless or interfering Discard stopping terms Deal with cross-included terms Reliability degree

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discriminative Term Selection Here the term “discriminative” indicates the utility in distinguishing categories. A term, 陳水扁, is used for distinguish 政治, 體育 classes. For a term t representing category g discriminability W(t, g) can be defined as

Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Index machine Using for locating keywords in a text. M = (S, I, g, f, s 0, O) For example, “ 半自動套裝遊程 ”

Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION For improving performance The vector space model (VSM) is used. The document is represented as a vector The member of vector is a weighted indexing feature Term weighting for training documents K categories, N k documents in k category D k, j is the jth document in kth category

Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for training documents S(w) is a smooth 0-1 function for avoiding bias problem α is a constant

Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for unclassified documents not know the category of an unclassified document, each unclassified document should be represented as multiple description vectors. unclassified document is represented as K vectors X k, k=1…K

Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Classification Function Combine the vectors of each category into a mean vector Classification function f Gk (X; A) is

Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT CORPUS Min-Sheng Daily News (MSDN) 44,675 text documents, consisting of over 35 million words 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. Performance Evaluation

Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT

Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Baseline performance Using the words defined in dictionary

Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Experimental Results on Purification Process

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Combined Approach-unknown word-based

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Comparative Performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Consistency between Training and Testing Data

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions we have proposed two new concepts, meaningful term extraction and discriminative term selection. PLUs improve the performance of text Purification process reduces the dimensionality of the feature space.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion  Advantages ─ Take into account meaningful and discriminative terms. ─ Purification process save time ─ Terms can be extracted automatically and systematically  Application ─ ICD9 codes classifications and so on. ─ May solve the problem that Patient records with Chinese and English  Limited ─ Sparse data problem need to solve.