Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Slides:

Advertisements

Similar presentations

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Validating Transliteration Hypotheses Using the Web: Web.

Advertisements

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel document similarity measure based on earth mover’s.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 SCAN: A Structural Clustering Algorithm for Networks Xiaowei.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Fast exact k nearest neighbors search using an orthogonal search tree Presenter : Chun-Ping Wu Authors.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor ：

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology U*F clustering : a new performant “ clustering-mining ”

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology HE-Tree: a framework for detecting changes in clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.

Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors :Shady Shehata, Fakhri Karray, Mohamed S. Kamel, Fellow 2012, IEEE An Efficient Concept-Based.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology SIGIR1 Improving Web Search Results Using Affinity Graph.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Ming Hsiao Author ： Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A semantic similarity metric combining features and intrinsic information content Presenter: Chun-Ping.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A k-mean clustering algorithm for mixed numeric and categorical.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extensions of vector quantization for incremental clustering.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Adaptation of the Vector-Space Model for Ontology-Based.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology O( ㏒ 2 M) Self-Organizing Map Algorithm Without Learning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Enhanced neural gas network for prototype-based clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Region-based image retrieval using integrated color, shape,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Semantic segment extraction and matching for Internet.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive FIR Neural Model for Centroid Learning in Self-Organizing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning multiple nonredundant clusterings Presenter :

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Towards comprehensive support for organizational mining Presenter : Yu-hui Huang Authors : Minseok Song,

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Predicting corporate bankruptcy using a self-organizing map: An empirical study to improve the forecasting.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering ： integrating data clustering over optimization.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text Classification, Business Intelligence, and Interactivity:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Visualizing social network concepts Presenter : Chun-Ping Wu Authors :Bin Zhu, Stephanie Watts, Hsinchun.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Adaptive Clustering for Multiple Evolving Streams Graduate.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A New Cluster Validity Index for Data with Merged Clusters.

Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing Text Clustering Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines  Motivation  Objectives  THEMATIC ROLES BACKGROUND  CONCEPT-BASED MINING MODEL  Experiments  Conclusions  Comments

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Vector Space Model (VSM) ─ represents each document as a feature vector of the terms (words or phrases) in the document. ─ Each feature vector contains term weights (usually term frequencies) of the terms in the document. ─ term frequency captures the importance of the term within a document only.  However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term.  Thus, the underlying text mining model should indicate terms that capture the semantics of text.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objectives  A new concept-based mining model is introduced. ─ captures the semantic structure of each term within a sentence and document rather than the frequency of the term within a document only ─ effectively discriminate between nonimportant terms and terms which hold the concepts that represent the sentence meaning. ─ three measures for analyzing concepts on the sentence, document, and corpus levels are computed ─ a new concept-based similarity measure is proposed. based on a combination of sentence-based, document-based, and corpus-based concept analysis. ─ more significant effect on the clustering quality due to the similarity ’ s insensitivity to noisy terms.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 THEMATIC ROLES BACKGROUND  Verb argument structure: (e.g., John hits the ball). ─ “ hits ” is the verb. ─ “ John ” and “ the ball ” are the arguments of the verb “ hits, ”  Label: A label is assigned to an argument, ─ e.g.: “ John ” has subject (or Agent) label. “ the ball ” has object (or theme) label,  Term: is either an argument or a verb. ─ either a word or a phrase  Concept: a labeled term.  Generally, the semantic structure of a sentence can be characterized by a form of verb argument structure

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 CONCEPT-BASED MINING MODEL

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 CONCEPT-BASED MINING MODEL  Sentence-Based Concept Analysis ─ Calculating ctf of Concept c in Sentence s the conceptual term frequency, ctf  the number of occurrences of concept c in verb argument structures of sentence s.  has the principal role of contributing to the meaning of s  a local measure on the sentence level ─ Calculating ctf of Concept c in Document d the overall importance of concept c to the meaning of its sentences in document d.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 CONCEPT-BASED MINING MODEL  Document-Based Concept Analysis ─ the concept-based term frequency tf the number of occurrences of a concept (word or phrase) c in the original document. a local measure on the document level  Corpus-Based Concept Analysis ─ the concept-based document frequency df the number of documents containing concept c used to reward the concepts that only appear in a small number of documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9  Three verbs, colored by red, that represent the semantic structure of the meaning of the sentence.  Each has its own arguments: ─ [ARG0 Texas and Australia researchers] have [TARGET created] [ARG1 industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles]. ─ Texas and Australia researchers have created industry-ready sheets of [ARG1 materials] [TARGET made] [ARG2 from nanotubes that could lead to the development of artificial muscles]. ─ Texas and Australia researchers have created industry-ready sheets of materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGM- MOD could] [TARGET lead] [ARG2 to the development of artificial muscles]. Example of Calculating ctf Measure Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 A clean step  To remove stop words  To stem the words

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 A Concept-Based Similarity Measure The single-term similarity measure is:  The concept-based similarity between two documents, d 1 and d 2 is calculated by: d1d1 d2d2 m matching concepts (using the TF-IDF weighting scheme)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Mathematical Framework  Assume that the content of document d 2 is changed by △  Sensitivity analysis: Assume that each concept consists of one word. In this case, each concept is a word and A =1. (?) By approximation, the d 1c value is bigger than d 1w and the △ d 2c value is bigger than the △ d 2w value. Hence, the sensitivity of the concept-based similarity is higher than the cosine similarity. This means that the concept-based model is deeper in analyzing the similarity between two documents than the traditional approaches.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Concept-Based Analysis Algorithm d1d1 d2d2 d3d3 d4d4 d1d1 d2d2 d3d3 d4d4 L LL LLL

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 EXPERIMENTAL RESULTS  Four data sets ─ 23,115 ACM abstract articles collected from the ACM digital library five main categories ─ 12,902 documents from the Reuters 21,578 data set five category sets ─ 361 samples from the Brown corpus main categories were press: reportage; press: reviews, religion, skills and hobbies, popular lore, belles-letters, and learned; fiction: science; fiction: romance and humor. ─ 20,000 messages collected from 20 Usenet newsgroups  Three standard document clustering techniques: ─ Hierarchical Agglomerative Clustering (HAC), ─ Single-Pass Clustering ─ k-Nearest Neighbor (k-NN) Evaluation methods

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Four different concept-based weighting schemes:

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Conclusions  Bridges the gap between natural language processing and text mining disciplines. (?)  By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.  A number of possibilities for extending this paper. ─ link this work to Web document clustering. ─ apply the same model to text classification.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Comments  Advantages ─ Better similarity considering the semantic structure of sentences in documents.  Shortages ─ Ambiguous algorithm  Applications ─ Text clustering ─ Text classification