1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Unsupervised learning
Data Mining Techniques: Clustering
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Ch 4: Information Retrieval and Text Mining
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Recommender systems Ram Akella November 26 th 2008.
Clustering Unsupervised learning Generating “classes”
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Text mining.
DATA MINING CLUSTERING K-Means.
Presented by Tienwei Tsai July, 2005
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Amy Dai Machine learning techniques for detecting topics in research papers.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Chapter 23: Probabilistic Language Models April 13, 2004.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Vector Space Models.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Unsupervised Learning from URL Corpora Deepak P*, IBM Research, Bangalore Deepak Khemani, Dept. of CS&E, IIT Madras *Work done while at IIT Madras.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Data Mining and Text Mining. The Standard Data Mining process.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Unsupervised Learning: Clustering
Information Organization: Overview
Semi-Supervised Clustering
Clustering of Web pages
Information Retrieval and Web Search
K-means and Hierarchical Clustering
Text Categorization Berlin Chen 2003 Reference:
Semi-Automatic Data-Driven Ontology Construction System
Information Organization: Overview
Presentation transcript:

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati

Content Problem formulation Algorithms Implementation Results Discussion and future work

Problem World Wide Web can be clustered into different subsets and labeled accordingly, search engine users can then restrict their keyword search to these specific subsets. Clustering of web pages can also be used to post-process searching results. Efficient clustering of web pages is important  Clustering accuracy: feature selection, and web exploitation  Fast algorithm

Web clustering Clustering is done based on similarity between web pages Clustering can be done in supervised and unsupervised mode In our project, we try to focus on unsupervised classification (no sample category labels provided), and compare the efficiency of algorithms and features for clustering web pages.

Project overview In this project, a platform of unsupervised clustering is implemented:  Vector Model is used TFIDF model (term frequency-inverted document frequency) Text, meta information, links and linked content can be configured as features  Similarity measure: Cosine similarity Euclidean similarity  Clustering algorithm K-means HAC (Hierarchical Agglomerative Clustering) For a given link list, clustering accuracy and algorithm efficiency is compared. It is implemented in Java, and can be extended easily.

User interface

Major functionalities Web page preprocessing  downloading  Parsing: link, meta, text extraction  Filtering of non-sense words: Stop word removal and stemming  Put terms into a pool clustering

Feature selection First, a naïve approach from ranking of query results is used:  All the unique terms (after text extraction and filtering) forms the feature terms. That is, if there are totally 1000 terms, the vector dimension will be  This approach works for small sets of links. Then we use all the unique terms appearing as meta information in web pages as feature terms.  The dimension can be reduced dramatically.  For 30 links, dimension is 2384 for naïve method, but is reduced to 408 when using meta. Hyperlink exploitation  Links in web page can also be features  The content or meta information of linked web pages can be seen as local content.

TFIDF based vector space model TFIDF(i,j)= TF(i,j)*IDF(i)  TF(i,j): the number of times word i occurs in document  DF(i) the number documents in which word i occurs at least once  IDF(i) can be calculated from the document frequency:

Similarity measure Euclidean similarity :Given the vector space defined by all terms compute the Euclidean distance between each document, and then the reciprocal is taken. Cosine similarity= numerator / denominator  Numerator: inner product of two vector  Denominator: Euclidean length of the document

Cluster algorithms: Hierarchical Agglomerative Clustering (HAC)  It starts with all the documents and successively combines them into groups within which inter-document similarity is high

Cluster algorithms: K means K means clustering: nonhierarchical method  Final required number of clusters is chosen  Examines each component in the population and assigns it to one of the clusters depending on the minimum distance  Centroid's position is recalculated every time a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters

Complexity Analysis HAC methods need to compute similarity of all pairs of n individual instances which is O(n 2 ). In K-means, for each round, n documents have to be compared against k centroids, which will take time more efficient than O(n 2 ) HAC. While in our experiment, we found that clustering result of HAC make more sense than K-means

Conclusion Unique features of web page should be exploited  Link, meta information HAC is better than K-means in clustering accuracy. Correct and robust parsing of web pages is important for web page clustering  Our parser doesn’t work well on all web pages tested. The overall performance of our implementation is not satisfactory  Dimension is still large  Space requirement  Parsing accuracy, and some page doesn’t have meta information