Clustering of Web pages

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Chapter 5: Introduction to Information Retrieval
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Web Intelligence Text Mining, and web-related Applications
Data Mining Techniques: Clustering
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Text mining.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Which of the two appears simple to you? 1 2.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Clustering C.Watters CS6403.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 430: Information Discovery Lecture 5 Ranking.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval in Practice
CSCE 590 Web Scraping – Information Extraction II
Searching for Information
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Informetrics, Webometrics and Web Use metrics
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Applying Key Phrase Extraction to aid Invalidity Search
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
From Unstructured Text to StructureD Data
Presentation transcript:

Clustering of Web pages Najlah Gali 21.3.2017

Web page clustering Organizing web pages into cohesive groups such that pages in the same cluster are more similar to each other than to those in other clusters. Entertainment Fitness

Motivation So, why summarization is needed? And where can we use it? Different kinds of applications and domains are using summarization. For example

Web search engines Finding similar or related web pages.

Web page classification

Queries’ similarity Two queries resulting in two different web pages within the same clusters can be recognized as being similar. Cluster Q 1 : Ravintola Q1 ≈ Q2 Q2: lounas

How to cluster? Trivial solutions such as using the specified tags in the web page are not perfect. For example

Clustering components Web page features Words Phrases Links Similarity measure Semantic similarity Syntactic similarity Clustering algorithm Partitional Hierarchal Graph based

Approaches to cluster web pages Two approaches exist: Link based: depends on the link structure between the pages Common neighbor Co-citation Text based: depends on the content of the web page Hyper based: depends on text and link structure

Link-based clustering common neighbor Two web pages are similar if they have neighbors in common. Similarity (a, b) = |O (a) ⋂ O |(b)| = |(c, d)| =2 In-link a b f c d e out-link

Link-based clustering Co-citation Two web pages are similar if they are referenced (cited) by similar pages. a b e c d c d f a b e g

Co-citation analysis [Larson 1996] start Create a collection P1, P2, P3, P4… Construct co-citation frequency matrix Convert raw freq. into correlation matrix Multidimensional scaling technique Apply agglomerative clustering

Co-citation example Part 1 Collection Retrieval strategy P1 |Pages cite P1 and P2| P2 P3 P4 |Pages cite P1 and P3| P5 P6 Co citation matrix P1 P2 P3 P4 P5 431 19 27 260 247 122 18 31 103 23 P6 13 110 234 Correlation matrix

Co-citation example Part 2 High correlation Low correlation P1 P2 P3 19 27 P4 260 247 P5 18 31 P6 13 P3 P4 P1 19 260 P2 27 431 P5 122 23 P6 110 18 P1 P2 P3 P4 P5 0.95 0.10 0.12 0.69 0.65 0.24 0.05 0.07 0.31 0.03 P6 0.57 0.85 Correlation Matrix Cluster

Issues (link-based clustering) It is useful when a web page lacks text content. However Web pages with insufficient in-links or out-links can not be clustered; Two web pages might be linked because they share a minor topic; Links can be noisy (adverts); No common links → similarity = 0!

Text-based clustering Content source Entire text Main content Snippet Keywords Feature extraction Binary Term frequency (TF) Term frequency-Inverse document frequency (TF-IDF) Similarity measure Character-based Token-based Clustering algorithm Partitional (K-means) Hierarchical (Agglomerative and divisive)

Content source Keywords Main content Snippet Entire text Office Equipment Supplies Shredder laminators Keywords Main content Entire text Snippet

Feature extraction Tokenization and stemming “Keep your office running smoothly with our wide…” Tokenize into words Keep, your, office, running, smoothly, with, our, wide Stem Keep, your, office, running, smoothli, with, our, wide

Feature extraction Stop words removal “Keep your office running smoothly with our wide…” Remove stop words (in, on, your, with, at) keep, offic, run, smoothli, wide

Feature extraction creation of feature vector Page 1: “Keep your office running smoothly with our wide…” Page 2: “..staffed office, keeping your office clean and staffed” Bag-of-words [keep, offic, run, smoothli, wide, staf, clean] Binary vector : 1 if occurs; 0 otherwise P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 1, 0, 0, 0, 1, 1] TF vector: counts number of occurrence of a word w in page p P1 [1, 1, 1, 1, 1, 0, 0] P2 [1, 2, 0, 0, 0, 2, 1]

Term frequency- Inverse document frequency  

Similarity Measures Character-based: treats strings as sequence of characters Single edit (insertion, deletion, substitution) is performed at a time to transfer a string into another Q-gram: divides strings into substrings of length q Token-based: treats strings as sequence of tokens Machine Learning mac, ach, chi, hin, ine, nel, ele, lea, ear, arn ... Similarity measures can be divided into four classes: Character-based which consider the title as one unit and compare character by character. Q-grams divides the title into sequence of characters. Token-based: which compare words instead of characters and finally a hybrid measure that combines the character-based measure and the token-based measure. Machine Learning 1 if match 0 otherwise Machine Learned Hybrid: combines character- and token-based measures

Token-based measures  

Results excellent good poor

K-means start Stop Select K random pages as centroids Assign other pages to nearest centroid N Converge? Calculate new centroids Y Stop

Clustering algorithms Hierarchal 4 3 2 c d 1 a b 4 e 3 1 2 a b c d e

Issues (text-based clustering) Developed for use in small, static and homogenous pages; Web pages lack text can not be clustered.

Hyper-based clustering [Modha and Spangler 2000] Represent the page as a triple of unit vectors (D, F, B) D : word frequencies in a page F : Out-links B : In-links Q e a g h m i c j k l n

Out-links vector Bag-of nodes: pages that are pointed to by at least two pages in Q [g, i, j, m] Q e a g h e g h i j k l 1 m m i c j k l n

In-links vector Bag-of nodes: pages that points to least two pages in Q [e, h, k, c] Q e a g h e g h i j k l 1 c m i c j k l n

Similarity between two pages   Cosine similarity

References Oikonomakou, N., & Vazirgiannis, M. (2009). A review of web document clustering approaches. In Data mining and knowledge discovery handbook (pp. 931-948). Springer US. Larson, R. R. (1996, October). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Proceedings of the Annual Meeting-American Society for Information Science (Vol. 33, pp. 71- 78). McCain, K. W. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American society for information science, 41(6), 433. Modha, D. S., & Spangler, W. S. (2000, May). Clustering hypertext with applications to web searching. In Proceedings of the eleventh ACM on Hypertext and hypermedia (pp. 143-152). ACM.

Thank you!