An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Web Intelligence Text Mining, and web-related Applications
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Text mining.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Generalized Fuzzy Clustering Model with Fuzzy C-Means Hong Jiang Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, US.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Co-clustering Documents and Words Using Bipartite Spectral Graph Partitioning Jinghe Zhang 10/28/2014 CS 6501 Information Retrieval.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Web- and Multimedia-based Information Systems Lecture 2.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Data Mining and Text Mining. The Standard Data Mining process.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
An Image Database Retrieval Scheme Based Upon Multivariate Analysis and Data Mining Presented by C.C. Chang Dept. of Computer Science and Information.
Clustering of Web pages
Julián ALARTE DAVID INSA JOSEP SILVA
Text & Web Mining 9/22/2018.
Information Organization: Clustering
Data Mining Chapter 6 Search Engines
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Topic 5: Cluster Analysis
Presentation transcript:

An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra

Outline Introduction Document Vector Clustering process Experiment Evaluation Conclusions 2

Introduction 3 Web Crawler Are programs used to discover and download documents from the web. Typically they perform a simulated browsing in the web by extracting links from pages, downloading the pointed web resources and repeating the process so many times. Focused Crawler It starts from a set of given pages and recursively explores the linked web pages. They only explore a small portion of the web using a best-first search

Introduction 4 Clustering Refers to the assignment of a set of elements (documents) into subsets (clusters) so that elements in the same cluster are similar in some sense. Purpose The article introduces a novel focused crawler that extracts and process cultural data from the web First phase: Surf the web Second phase: WebPages are separated in different clusters depending on the thematic Creation of Multidimensional document vector Calculating the distance between the documents Group by clusters

Retrieval of Web Documents and Calculation of Documents Distance Matrix 5

Document Vector 6 a b a b a c c d d c c d d c c d d c c [3a, 2b, 8c, 6d] [8c, 6d, 3a, 2b] [8c, 6d] T = 2

Document Vectors Distance Matrix 7 Let’s consider 2 strings S1 = {x1, x2, …, xn} and S2 = {y1, y2, y3, …, yn}, and the distance will be defined as: DV1 = [3a, 4b, 2c] DV2 = [3a, 4b, 8c] DV3 = [a, b, c] DV4 = [d, e, f] H(DV1, DV2) = |3-3| + |4-4| + |2-8| = 6 H(DV3, DV4) = |1-0| + |1-0| + |1-0| + |0-1| + |0-1| + |0-1|= 6

Document Vectors Distance Matrix 8 WH(S1, S2) = xi є S2yi є S1wi c 10c 11c DV1 = [3a, 4b, 2c] DV2 = [3a, 4b, 8c] DV3 = [a, b, c] DV4 = [d, e, f] H(DV1, DV2) = 0.5 * |3-3| * |4-4| * |8-2| = 3 H(DV3, DV4) = 1 * |1-0| + 1 * |1-0| + 1 * |1-0| + 1 * |0-1| + 1 * |0-1| + 1 * |0-1|= 6

Clustering Process 9 1. Get the document vectors for all the documents 2. Calculate the potential of a i-th document vector Note: A document vector with a high potential is surrounded by many document vectors.

Clustering Process Set n = n Calculate the maximum potential value. 5. Select the document Ds that corresponds to this Z_max 6. Remove from X all documents that has a similarity with Ds greater than β and assign them to the n-th cluster 7. If X is empty stop, Else go to step 3 Appealing Features It’s a very fast procedure and easy to implement No random selection of initial clusters Select the centroids based on the structure of the data set itself

Clustering Process 11

Clustering Process 12 How to decide the values for α and β ? Perform simulations for all possible values (time consuming) Approach: set α = 0.5 and calculate the best value for β with a validity index Validity Index It uses 2 components: Compactness measure: The members of each cluster should be as close to each other as possible Separation measure: whether the clusters are well-separated ?

Clustering Process 13 Compactness Separation

Experimental Evaluation 14 It was performed in 1000 WebPages The categories were: 1. Cultural conservation 2. Cultural heritage 3. Painting 4. Sculpture 5. Dancing 6. Cinematography 7. Architecture Museum 8. Archaeology 9. Folklore 10. Music 11. Theatre 12. Cultural Events 13. Audiovisual Arts 14. Graphics Design 15. Art History

Experimental Evaluation 15

Experimental Evaluation 16 Download 1000 WebPages Select the 200 most frequent words 20% of their content is cultural terms? Frequency of word w in all documents Maximum frequency of any word in all documents Number of documents of the whole collection Number of documents that includes word w Note: Words that appear in the majority of the documents, they will have less weight For each word T = 30 Train Create clusters Centroids

Experimental Evaluation 17 Download Webpage Select the 200 most frequent words 20% of their content is cultural terms? For each word T = 30 Test Get Feature Vector (FV) Assign Category. Find the minimum distance for each category Centroids Select the category with minimum distance

Experimental Evaluation 18

Conclusions 19

Questions 20

21 References 1. D. Gavalas and G. Tsekouras. (2013). An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining. International Journal of Software Engineering and Knowledge Engineering. Volume 23, Issue G.E. Tsekouras, C.N. Anagnostopoulos, D. Gavalas, D. Economou (2007). Classification of Web Documents using Fuzzy Logic Categorical Data Clustering, Proceedings of the 4th IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI’2007). Volume 247, pages