Download presentation
Presentation is loading. Please wait.
Published byJoseph Parrish Modified over 9 years ago
1
Clustering of Web Documents Jinfeng Chen
2
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using Web Logs, 2001. Hua-Jun Zeng,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results
3
Correlation-based Document Clustering using Web Logs Introduction Introduction Using web log data to construct clusters. Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related. Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.
4
DBSCAN Does not require the user to pre-specify the number of clusters. Only one scan through the database. A radius value ε and a value Mpts. ε - distance measure (radius) ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object Mpts – number of minimal points that should occur in around a dense object
5
DBSCAN algorithm (con’d) Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do for each o belong to DB do if o is not yet assigned to a cluster if o is not yet assigned to a cluster if o is a core-object then if o is a core-object then collect all objects density-reachable form o collect all objects density-reachable form o according to ε and MinPts according to ε and MinPts assign them to a new cluster; assign them to a new cluster;
6
Limitations of DBSCAN in Clustering of web document Performance clustering using a fixed threshold value to determine “dense” regions in the document space. Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.
7
RDBC algorithm (recursive density based clustering) Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points. Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.
8
RDBC algorithm (con’d) For avoid connecting too many clusters through “bridge” For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } { DBSCAN(ε,Mpts, WebPageSet) } else else { ε= ε/2; Mpts=Mpts/4; { ε= ε/2; Mpts=Mpts/4; RDBC (ε, Mpts, WebPageSet); RDBC (ε, Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε 2 around clusters found in last step according to ε 2 } }
9
Construct WebPageSet from web logs Step 1 Step 2 Delete visit of image files. Step 3 Extract sessions from the data.
10
Construct WebPageSet (con’d) Step 4 Create a distance matrix 1) Determine the size of a moving window, 1) Determine the size of a moving window, within which URL requests within which URL requests will be regarded as co-occurrence. will be regarded as co-occurrence. 2) Calculate the co-occurrence times N i,,j, and 2) Calculate the co-occurrence times N i,,j, and N i, N j of this pair of URL’s. N i, N j of this pair of URL’s.
11
Construct WebPageSet (con’d) Step 4 Create a distance matrix 3) P(p i | p j )= N i,j /N j 3) P(p i | p j )= N i,j /N j 4) Three Distance function 4) Three Distance function
12
Experimental Validation
13
Conclusions A new algorithm for clustering web documents based only on the log data. It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN
14
Learning to Cluster Web Search Results Introduction Introduction This algorithm based on salient phrase come from documents contents. Fast enough to be used in online calculation engine.
15
Characteristics of Cluster web search results Existing search engines such as Google,Yahoo and MSN often return long list of search results. Clustering of similar search results helps users find relevant results.
16
Clustered Search results
17
Conventional Search results
18
Procedure of algorithm Step 1: Search result fetching Step 2: Document paring and Phrase property calculation Step 3: Salient phrase ranking
19
Search result fetching Input a query to a conventional web search engine Getting the webpage of results returned by engine. Extracting the title and snippets.
20
Document parsing Step 1: Cleaning Stemming (use Porter’ algorithm) Stemming (use Porter’ algorithm) Sentence boundary identification Sentence boundary identification Step 2:Post-processing Punctuation elimination Punctuation elimination Filter out stop-words, ex: ‘too’ ‘are’Filter out stop-words, ex: ‘too’ ‘are’ Filter out query wordFilter out query word Ex: Microsoft software is available to students.Ex: Microsoft software is available to students.
21
Phrase property calculation Five properties 1. Phrase Frequency/Inverted Document Frequency 1. Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1 LEN=n ex:LEN(”big”) =1
22
Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid o: centroid Here di={TFIDF1,TFIDF2,…}, Each component of the vectors represents TFIDF of a phrase
23
Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… Ex: three “vectors” has… with some “vectors” be… with some “vectors” be…
24
Learning to rank key phrases Using Regression model to combine above five properties, calculating a single salience score for each phrase Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y. Here x=(TFIDF,LEN,ICS,CE,IND)
25
Learning to rank key phrases Three Regression Linear Regression Linear Regression Logistic Regression Logistic Regression Support Vector Regression Support Vector Regression
26
Evaluation
27
Conclusions Change the search result clustering problem to be a supervised salient phrase ranking problem. Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.
28
Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.