 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Chapter 5: Introduction to Information Retrieval
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION Conceptualization of Place via Spatial Clustering and Co- occurrence Analysis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Recommender systems Ram Akella November 26 th 2008.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Friends and Locations Recommendation with the use of LBSN
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
The identification of interesting web sites Presented by Xiaoshu Cai.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Chapter 23: Probabilistic Language Models April 13, 2004.
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
Clustering of Web pages
Presentation transcript:

 Clustering of Web Documents Jinfeng Chen

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using Web Logs, Hua-Jun Zeng,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results

Correlation-based Document Clustering using Web Logs Introduction Introduction  Using web log data to construct clusters.  Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related.  Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.

DBSCAN  Does not require the user to pre-specify the number of clusters.  Only one scan through the database.  A radius value ε and a value Mpts. ε - distance measure (radius) ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object Mpts – number of minimal points that should occur in around a dense object

DBSCAN algorithm (con’d)  Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do for each o belong to DB do if o is not yet assigned to a cluster if o is not yet assigned to a cluster if o is a core-object then if o is a core-object then collect all objects density-reachable form o collect all objects density-reachable form o according to ε and MinPts according to ε and MinPts assign them to a new cluster; assign them to a new cluster;

Limitations of DBSCAN in Clustering of web document  Performance clustering using a fixed threshold value to determine “dense” regions in the document space.  Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.

RDBC algorithm (recursive density based clustering)  Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points.  Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.

RDBC algorithm (con’d) For avoid connecting too many clusters through “bridge” For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } { DBSCAN(ε,Mpts, WebPageSet) } else else { ε= ε/2; Mpts=Mpts/4; { ε= ε/2; Mpts=Mpts/4; RDBC (ε, Mpts, WebPageSet); RDBC (ε, Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε 2 around clusters found in last step according to ε 2 } }

Construct WebPageSet from web logs  Step 1  Step 2 Delete visit of image files.  Step 3 Extract sessions from the data.

Construct WebPageSet (con’d)  Step 4 Create a distance matrix 1) Determine the size of a moving window, 1) Determine the size of a moving window, within which URL requests within which URL requests will be regarded as co-occurrence. will be regarded as co-occurrence. 2) Calculate the co-occurrence times N i,,j, and 2) Calculate the co-occurrence times N i,,j, and N i, N j of this pair of URL’s. N i, N j of this pair of URL’s.

Construct WebPageSet (con’d)  Step 4 Create a distance matrix 3) P(p i | p j )= N i,j /N j 3) P(p i | p j )= N i,j /N j 4) Three Distance function 4) Three Distance function

Experimental Validation

Conclusions  A new algorithm for clustering web documents based only on the log data.  It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN

Learning to Cluster Web Search Results Introduction Introduction  This algorithm based on salient phrase come from documents contents.  Fast enough to be used in online calculation engine.

Characteristics of Cluster web search results  Existing search engines such as Google,Yahoo and MSN often return long list of search results.  Clustering of similar search results helps users find relevant results.

Clustered Search results

Conventional Search results

Procedure of algorithm  Step 1: Search result fetching  Step 2: Document paring and Phrase property calculation  Step 3: Salient phrase ranking

Search result fetching  Input a query to a conventional web search engine  Getting the webpage of results returned by engine.  Extracting the title and snippets.

Document parsing  Step 1: Cleaning Stemming (use Porter’ algorithm) Stemming (use Porter’ algorithm) Sentence boundary identification Sentence boundary identification  Step 2:Post-processing Punctuation elimination Punctuation elimination Filter out stop-words, ex: ‘too’ ‘are’Filter out stop-words, ex: ‘too’ ‘are’ Filter out query wordFilter out query word Ex: Microsoft software is available to students.Ex: Microsoft software is available to students.

Phrase property calculation  Five properties 1. Phrase Frequency/Inverted Document Frequency 1. Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1 LEN=n ex:LEN(”big”) =1

Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid o: centroid  Here di={TFIDF1,TFIDF2,…},  Each component of the vectors represents TFIDF of a phrase

Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… Ex: three “vectors” has… with some “vectors” be… with some “vectors” be…

Learning to rank key phrases  Using Regression model to combine above five properties, calculating a single salience score for each phrase  Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y.  Here x=(TFIDF,LEN,ICS,CE,IND)

Learning to rank key phrases  Three Regression Linear Regression Linear Regression Logistic Regression Logistic Regression Support Vector Regression Support Vector Regression

Evaluation

Conclusions  Change the search result clustering problem to be a supervised salient phrase ranking problem.  Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.

Thanks!