Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.

Slides:



Advertisements
Similar presentations
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Advertisements

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Disasters and Human Factors Literature Nestor L Osorio Northern Illinois University.
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Dongyeop Kang1, Youngja Park2, Suresh Chari2
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti.
Non Negative Matrix Factorization
Data Mining and Machine Learning Lab Network Denoising in Social Media Huiji Gao, Xufei Wang, Jiliang Tang, and Huan Liu Data Mining and Machine Learning.
1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
STAT 3610/5610 – Time Series Analysis Topics to be covered Chapters 1 – 8 Chapter 9 – Skip Chapters 10 – 12.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Latent Friend Mining from Blog Data (ICDM’06). Outline About the Blog Problem statement Three kinds of Approaches Experiment Conclusion.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Using Classification Trees to Decide News Popularity
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Unsupervised Streaming Feature Selection in Social Media
Memoryless Document Vector Dongxu Zhang Advised by Dong Wang
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Dr. Yuying Chris Chang Project Writing ( 專題製作 )
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
DeepWalk: Online Learning of Social Representations
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Recent Trends in Text Mining
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
A Consensus-Based Clustering Method
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Fake News Detection - Social Article Fusion
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University

News Article Lead Paragraph Explanations Additional Information

Research Paper Introduction Related Work Problem Statement Solution Experiment Conclusion

Book Chapters References Appendix

Document Organization Not randomly organized Put relevant content together Logically independent segments 5

Matrix Space Model Represent a document as a matrix –Segment –Term Each segment is a vector of terms –Terms + frequency 6

Vector Space Model Oversimplify a document –Mixing topics –Word order is lost –Susceptible to noise 7

An Example The IEEE International Conference on Data Mining series (ICDM) has established itself as the world's premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences. The conference covers all aspects of data mining, including algorithms, software and systems, and applications. Vancouver is a coastal seaport city on the mainland of British Columbia, Canada. It is the hub of Greater Vancouver, which, with over 2.3 million residents, is the third most populous metropolitan area in the country, and the most populous in Western Canada.

Matrix Space Model Segment 1 Segment 2 conferencedataminingresearchinternational vancouverpopulouscanadametropolitancolumbia 22211

Vector Space Model conferencedataminingresearchinternationalcanadavancouver

Pros of a Matrix Representation Interpretation: –Segments vs. topics Finer granularity for data management –Segments vs. document Multiple or Single class labels –Flexibility 11

Verify the Effectiveness via Clustering Information Retrieval –Indexing based on segments Classification –New approaches based on Matrix inputs Clustering –New approaches based on Matrix inputs 12

A Graphical Interpretation

Step 1: Obtaining Segments Many approaches for segmentation –Terms –Sentences (Choi et al. 2000) –Paragraphs (Tagarelli et al. 2008) Determining the number of segments –Open research problem 14

Step 2: Latent Topic Extraction Non-negative Matrix Approximation (NMA) LM i represents the probability of a term belonging to a latent topic M i R T represents the probability of a segment belonging to a latent topic 15

Step 3: Clustering Un-overlapping clustering Overlapping clustering 16

Datasets 20newsgroup –20 classes –6,038 documents Reuters –26 clusters –1,964 documents Classic –3 clusters –1,486 documents 17

Experimental Method Generate Datasets (by specifying k) Evaluate the accuracy Repeat 18

Number of Latent Topics

Number of Segments

Comparative Study

Conclusion Proposing a matrix representation for documents Significant improvements with MSM Information Retrieval, classification tasks 22

Questions

24

25