Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Similar presentations


Presentation on theme: " Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using."— Presentation transcript:

1  Clustering of Web Documents Jinfeng Chen

2 Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using Web Logs, 2001. Hua-Jun Zeng,Qi cai He,Zheng Chen,Weiyin Ma and Jinwen Ma,Learning to Cluster Web Search Results

3 Correlation-based Document Clustering using Web Logs Introduction Introduction  Using web log data to construct clusters.  Frequent simultaneous visits to two seemingly unrelated documents should indicate that they are in fact closely related.  Basic algorithm is DBSCAN, an algorithm to group neighboring objects of the database into clusters based on local distance information.

4 DBSCAN  Does not require the user to pre-specify the number of clusters.  Only one scan through the database.  A radius value ε and a value Mpts. ε - distance measure (radius) ε - distance measure (radius) Mpts – number of minimal points that should occur in around a dense object Mpts – number of minimal points that should occur in around a dense object

5 DBSCAN algorithm (con’d)  Algorithm DBSCAN(DB, ε,Minpts) for each o belong to DB do for each o belong to DB do if o is not yet assigned to a cluster if o is not yet assigned to a cluster if o is a core-object then if o is a core-object then collect all objects density-reachable form o collect all objects density-reachable form o according to ε and MinPts according to ε and MinPts assign them to a new cluster; assign them to a new cluster;

6 Limitations of DBSCAN in Clustering of web document  Performance clustering using a fixed threshold value to determine “dense” regions in the document space.  Thus the algorithm often cannot distinguish between dense and loose points, often the entire document space is lumped into a single cluster.

7 RDBC algorithm (recursive density based clustering)  Key difference between RDBC and DBSCAN is that in RDBC, the identification of core points are performed separately from that of clustering each individual data points.  Different values of ε and Mpts are used in RDBC to identify this core point set, Cset.

8 RDBC algorithm (con’d) For avoid connecting too many clusters through “bridge” For avoid connecting too many clusters through “bridge” Set initial value ε=ε1 and Mpts=Mpts1; Set initial value ε=ε1 and Mpts=Mpts1; WebPageSet=web_log WebPageSet=web_log RDBC(ε,Mpts, WebPageSet) { RDBC(ε,Mpts, WebPageSet) { use ε, Mpts to get the core point Cset use ε, Mpts to get the core point Cset if size (Cset > size(webPageSet)/2 if size (Cset > size(webPageSet)/2 { DBSCAN(ε,Mpts, WebPageSet) } { DBSCAN(ε,Mpts, WebPageSet) } else else { ε= ε/2; Mpts=Mpts/4; { ε= ε/2; Mpts=Mpts/4; RDBC (ε, Mpts, WebPageSet); RDBC (ε, Mpts, WebPageSet); Collect all other points in (WebPageSet-Cset) Collect all other points in (WebPageSet-Cset) around clusters found in last step according to ε 2 around clusters found in last step according to ε 2 } }

9 Construct WebPageSet from web logs  Step 1  Step 2 Delete visit of image files.  Step 3 Extract sessions from the data.

10 Construct WebPageSet (con’d)  Step 4 Create a distance matrix 1) Determine the size of a moving window, 1) Determine the size of a moving window, within which URL requests within which URL requests will be regarded as co-occurrence. will be regarded as co-occurrence. 2) Calculate the co-occurrence times N i,,j, and 2) Calculate the co-occurrence times N i,,j, and N i, N j of this pair of URL’s. N i, N j of this pair of URL’s.

11 Construct WebPageSet (con’d)  Step 4 Create a distance matrix 3) P(p i | p j )= N i,j /N j 3) P(p i | p j )= N i,j /N j 4) Three Distance function 4) Three Distance function

12 Experimental Validation

13 Conclusions  A new algorithm for clustering web documents based only on the log data.  It change the parameters intelligently during the recursively process, RDBC can give clustering results more superior than that of DBSCAN

14 Learning to Cluster Web Search Results Introduction Introduction  This algorithm based on salient phrase come from documents contents.  Fast enough to be used in online calculation engine.

15 Characteristics of Cluster web search results  Existing search engines such as Google,Yahoo and MSN often return long list of search results.  Clustering of similar search results helps users find relevant results.

16 Clustered Search results

17 Conventional Search results

18 Procedure of algorithm  Step 1: Search result fetching  Step 2: Document paring and Phrase property calculation  Step 3: Salient phrase ranking

19 Search result fetching  Input a query to a conventional web search engine  Getting the webpage of results returned by engine.  Extracting the title and snippets.

20 Document parsing  Step 1: Cleaning Stemming (use Porter’ algorithm) Stemming (use Porter’ algorithm) Sentence boundary identification Sentence boundary identification  Step 2:Post-processing Punctuation elimination Punctuation elimination Filter out stop-words, ex: ‘too’ ‘are’Filter out stop-words, ex: ‘too’ ‘are’ Filter out query wordFilter out query word Ex: Microsoft software is available to students.Ex: Microsoft software is available to students.

21 Phrase property calculation  Five properties 1. Phrase Frequency/Inverted Document Frequency 1. Phrase Frequency/Inverted Document Frequency 2.Phrase Length LEN=n ex:LEN(”big”) =1 LEN=n ex:LEN(”big”) =1

22 Phrase property calculation (con’d) 3.Intra-Cluster Similarity o: centroid o: centroid  Here di={TFIDF1,TFIDF2,…},  Each component of the vectors represents TFIDF of a phrase

23 Phrase property calculation (con’d) 4. Cluster Entropy 5. Phrase Independence Ex: three “vectors” has… Ex: three “vectors” has… with some “vectors” be… with some “vectors” be…

24 Learning to rank key phrases  Using Regression model to combine above five properties, calculating a single salience score for each phrase  Regression is a algorithm which tries to determine the relationship between two random variables X=(x1,x2,…xn) and y.  Here x=(TFIDF,LEN,ICS,CE,IND)

25 Learning to rank key phrases  Three Regression Linear Regression Linear Regression Logistic Regression Logistic Regression Support Vector Regression Support Vector Regression

26 Evaluation

27 Conclusions  Change the search result clustering problem to be a supervised salient phrase ranking problem.  Generate the correct clusters with short name, thus could improve user’s browsing efficiency through search result.

28 Thanks!


Download ppt " Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using."

Similar presentations


Ads by Google