ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.

Slides:



Advertisements
Similar presentations
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Advertisements

Link Analysis: PageRank
CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.
Self Organization of a Massive Document Collection
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Writing in Blogs: Developing Your Students’ Digital Fluency while strengthening traditional literacy skills.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Commentary-based Video Categorization and Concept Discovery By Janice Leung.
Clustering.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Text mining.
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Data mining and machine learning A brief introduction.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
The identification of interesting web sites Presented by Xiaoshu Cai.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Universit at Dortmund, LS VIII
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 30 Nov 11, 2005 Nanjing University of Science & Technology.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Communicating in the 21 st Century BLOGS. Review at least 3 different blogs online. List the three sites and address these questions: What is the purpose.
1 CS 430: Information Discovery Lecture 5 Ranking.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
2016/2/131 Structural and Temporal Analysis of the Blogosphere Through Community Factorization Y. Chi, S. Zhu, X. Song, J. Tatemura, B.L. Tseng Proceedings.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Similarity Measures for Text Document Clustering
Information Organization: Overview
Unsupervised Learning and Clustering
Information Organization: Overview
Presentation transcript:

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky ACMSE’07

INTRODUCTION blogs highly opinionated personal online commentary including hyperlinks to other resources Technorati (July, 2006) tracking more than 50 million blogs about 175,000 blogs were created daily size of the blogosphere doubles every six months how many blog authors are updating their blogs regularly -> not clear

INTRODUCTION(CON.) analysis of the blogosphere in 2004 more than two-thirds of public blogs are personal journals knowledge blogs (k-blogs) -> mere 3 percent due to the diverse background of the blog authors and readers the blogosphere has hyper-accelerated the spread of information

BLOGS V.S. WEBPAGES  the major difference between blogs and the standard web pages  blogs are dated  most of blogs allow readers to place comments on each blog document  creates communication channels between the blog authors and the readers  blog authors can place individual blogs into different categories  according to some predefined categories  the definitions of the categories may be different for different authors

BLOG DOCUMENTS use vector-space model to encode the blog web pages each blog page can be viewed as a column vector each word used can be considered as one row of the matrix consider a blog page as three parts blog title blog body the content of the blog page comments of the authors and/or the readers

A SAMPLE BLOG PAGE

H YPOTHESIS hypothesis the use of title and comment words in the dataset will enhance the discrimination of the blog pages result in more accurate clustering solutions reason the words in the comments reflect the specific views and questions and answers of the authors and the readers may hold more weights in discriminating individual blog pages

DATA PREPARATION AND CLUSTERING Data Preprocessing selected three categories of blog files gun control church Alzheimer’s disease downloaded from Windows Live Spaces by searching with the key words each entry has at least one comment each category has 70 files for a total of 210 blog files parsing  convert into 3 parts  stemming  delete stop words  count the number of occurrences of each word

DATA PREPROCESSING(CON.) represent each document by three vectors vector for the whole document is a weighted sum of all three vectors: w t : title weight w b : body weight w c : comment weight

DATA PREPROCESSING(CON.) the word-page matrix A is composed of a set of such document vectors A = (v 1 … v m ) v ij is the weighted occurrences of the word i in the document v j to balance the influence of small size and large size documents scale each document vector v j to have its Euclidean norm equal to 1

tf-idf TI is the mean value of tfidf over all the documents for each term use TI to measure the quality of the term the higher the TI value is, the better the term is to be ranked F EATURE S ELECTION

C LUSTERING k-means algorithm 1. It computes the Euclidean distance from each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance 2. each cluster center is recomputed to be the mean of its constituent documents 3. repeat steps 1. and 2. until the convergence is reached

criterion function for the convergence r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance from the document vi to a cluster center cj given a convergence criterion ε the k-means algorithm stops when |fr+1 - fr| < ε CLUSTERING(CON.)

CLUSTERING METRICS Entropy gauges the distribution of each class of documents within each cluster suppose there are q classes and the clustering algorithm returns k clusters the entropy E of a cluster S r of size n r is computed as is the number of documents in the i th class that are assigned to the r th cluster entropy of the entire clustering solution is computed as:

CLUSTERING METRICS(CON.) Purity the purity of the cluster S r can be defined as purity value of the entire clustering solution is computed as

EXPERIMENTAL RESULTS influence of weight not very good if only use one of the title, body, or comment the accuracy of clustering the blog body is better than title or comments using all of the three parts improves a lot

EXPERIMENTAL RESULTS Feature Selection use only the title and the body for clustering reducing the percentage of the features used will not change the clustering accuracy apply feature selection to all the blog content including the comments with certain percentage of features selected, entropy value can be reduced  making good use of the terms in comments can help increase clustering accuracy

S UMMARY utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages Future work consider the timing effect of the blogs better clustering blog documents finding blog communities the utilization of predefined category information may also improve the classification of blog files experimenting other data mining algorithms with blog datasets