Clustering of Web pages

Clustering of Web pages
Najlah Gali

Web page clustering Organizing web pages into cohesive groups such that pages in the same cluster are more similar to each other than to those in other clusters. Entertainment Fitness

Motivation So, why summarization is needed? And where can we use it?
Different kinds of applications and domains are using summarization. For example

Web search engines Finding similar or related web pages.

Web page classification

Queries’ similarity Two queries resulting in two different web pages within the same clusters can be recognized as being similar. Cluster Q 1 : Ravintola Q1 ≈ Q2 Q2: lounas

How to cluster? Trivial solutions such as using the specified tags in the web page are not perfect. For example

Clustering components
Web page features Words Phrases Links Similarity measure Semantic similarity Syntactic similarity Clustering algorithm Partitional Hierarchal Graph based

Approaches to cluster web pages
Two approaches exist: Link based: depends on the link structure between the pages Common neighbor Co-citation Text based: depends on the content of the web page Hyper based: depends on text and link structure

Link-based clustering common neighbor
Two web pages are similar if they have neighbors in common. Similarity (a, b) = |O (a) ⋂ O |(b)| = |(c, d)| =2 In-link a b f c d e out-link

Link-based clustering Co-citation
Two web pages are similar if they are referenced (cited) by similar pages. a b e c d c d f a b e g

Co-citation analysis [Larson 1996]
start Create a collection P1, P2, P3, P4… Construct co-citation frequency matrix Convert raw freq. into correlation matrix Multidimensional scaling technique Apply agglomerative clustering

Co-citation example Part 1
Collection Retrieval strategy P1 |Pages cite P1 and P2| P2 P3 P4 |Pages cite P1 and P3| P5 P6 Co citation matrix P1 P2 P3 P4 P5 431 19 27 260 247 122 18 31 103 23 P6 13 110 234 Correlation matrix

Co-citation example Part 2
High correlation Low correlation P1 P2 P3 19 27 P4 260 247 P5 18 31 P6 13 P3 P4 P1 19 260 P2 27 431 P5 122 23 P6 110 18 P1 P2 P3 P4 P5 0.95 0.10 0.12 0.69 0.65 0.24 0.05 0.07 0.31 0.03 P6 0.57 0.85 Correlation Matrix Cluster

Issues (link-based clustering)
It is useful when a web page lacks text content. However Web pages with insufficient in-links or out-links can not be clustered; Two web pages might be linked because they share a minor topic; Links can be noisy (adverts); No common links → similarity = 0!

Text-based clustering
Content source Entire text Main content Snippet Keywords Feature extraction Binary Term frequency (TF) Term frequency-Inverse document frequency (TF-IDF) Similarity measure Character-based Token-based Clustering algorithm Partitional (K-means) Hierarchical (Agglomerative and divisive)

Content source Keywords Main content Snippet Entire text Office
Equipment Supplies Shredder laminators Keywords Main content Entire text Snippet

Feature extraction Tokenization and stemming
“Keep your office running smoothly with our wide…” Tokenize into words Keep, your, office, running, smoothly, with, our, wide Stem Keep, your, office, running, smoothli, with, our, wide

Feature extraction Stop words removal
“Keep your office running smoothly with our wide…” Remove stop words (in, on, your, with, at) keep, offic, run, smoothli, wide

Feature extraction creation of feature vector
Page 1: “Keep your office running smoothly with our wide…” Page 2: “..staffed office, keeping your office clean and staffed” Bag-of-words [keep, offic, run, smoothli, wide, staf, clean] Binary vector : 1 if occurs; 0 otherwise P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 1, 0, 0, 0, 1, 1] TF vector: counts number of occurrence of a word w in page p P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 2, 0, 0, 0, 2, 1]

Term frequency- Inverse document frequency

Similarity Measures Character-based: treats strings as sequence of characters Single edit (insertion, deletion, substitution) is performed at a time to transfer a string into another Q-gram: divides strings into substrings of length q Token-based: treats strings as sequence of tokens Machine Learning mac, ach, chi, hin, ine, nel, ele, lea, ear, arn ... Similarity measures can be divided into four classes: Character-based which consider the title as one unit and compare character by character. Q-grams divides the title into sequence of characters. Token-based: which compare words instead of characters and finally a hybrid measure that combines the character-based measure and the token-based measure. Machine Learning 1 if match 0 otherwise Machine Learned Hybrid: combines character- and token-based measures

Token-based measures

Results excellent good poor

K-means start Stop Select K random pages as centroids
Assign other pages to nearest centroid N Converge? Calculate new centroids Y Stop

Clustering algorithms Hierarchal
4 3 2 c d 1 a b 4 e 3 1 2 a b c d e

Issues (text-based clustering)
Developed for use in small, static and homogenous pages; Web pages lack text can not be clustered.

Hyper-based clustering [Modha and Spangler 2000]
Represent the page as a triple of unit vectors (D, F, B) D : word frequencies in a page F : Out-links B : In-links Q e a g h m i c j k l n

Out-links vector Bag-of nodes: pages that are pointed to by at least two pages in Q [g, i, j, m] Q e a g h e g h i j k l 1 m m i c j k l n

In-links vector Bag-of nodes: pages that points to least two pages in Q [e, h, k, c] Q e a g h e g h i j k l 1 c m i c j k l n

Similarity between two pages
Cosine similarity

References Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data mining and knowledge discovery handbook (pp ). Springer US. Larson, R. R. (1996, October). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting-American Society for Information Science (Vol. 33, pp ). McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American society for information science, 41(6), 433. Modha, D. S., & Spangler, W. S. (2000, May). Clustering hypertext with applications to web searching. In Proceedings of the eleventh ACM on Hypertext and hypermedia (pp ). ACM.

Thank you!

Clustering of Web pages

Similar presentations

Presentation on theme: "Clustering of Web pages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering of Web pages

Similar presentations

Presentation on theme: "Clustering of Web pages"— Presentation transcript:

Similar presentations

About project

Feedback