Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining 2013 www.cst.ps/staff/mfarra.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Albert Gatt Corpora and Statistical Methods Lecture 13.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Graph-based cluster labeling using Growing Hierarchal SOM Mahmoud Rafeek Alfarra College Of Science & Technology The second International.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Introduction to Data Mining Engineering Group in ACL.
Clustering Unsupervised learning Generating “classes”
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Community Architectures for Network Information Systems
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Kohonen Mapping and Text Semantics Xia Lin College of Information Science and Technology Drexel University.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A Graph-based Friend Recommendation System Using Genetic Algorithm
A New Suffix Tree Similarity Measure for Document Clustering
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.
Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
Clustering.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extending the Growing Hierarchal SOM for Clustering Documents.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Clustering of Web pages
Prepared by: Mahmoud Rafeek Al-Farra
Clustering medical and biomedical texts – document map based approach
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
CSE 4705 Artificial Intelligence
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Text Categorization Berlin Chen 2003 Reference:
Group 9 – Data Mining: Data
Presentation transcript:

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining Chapter 6_3: Clustering Methods

Course’s Out Lines  Introduction  Data Preparation and Preprocessing  Data Representation  Classification Methods  Evaluation  Clustering Methods  Mid Exam  Association Rules  Knowledge Representation  Special Case study : Document clustering  Discussion of Case studies by students 2

Out Lines  Cluster validation  Similarity measure  Overall Similarity  Entropy  Examples of Document Clustering algorithm  Suffix Tree Clustering algorithm  DIG for document representation  A SOM-Based document clustering using phrases  Text Clustering using SOM on Semantic Graphs  Graph-based Growing Hierarchal Self-Organizing Map 3

Cluster validation- intro. 4  The results of any clustering algorithm should be evaluated using an informative quality measure that reflects the “goodness” of the resulting clusters.  In addition, it gives us the possibility to compare the different clustering algorithms that different approaches usually lead to different clusters.

Cluster validation- intro.  Generally, there are two main measures for testing quality of clusters, using of them depends on whether we have labeled data or there is no prior knowledge about the classification of data objects. 5 Measure ExampleMeasure Type Overall similarityInternal quality measure Entropy and F-measureExternal quality measure

Similarity measure  Similarity in VSM is based on the distance or angle between two vectors. One of the most widely used distance measure is known as the family of Minkowski distances which described as:  Where:  X and Y the vectors of two objects  i : the feature, n number of features  p: assumes values greater than or equal to 1 6

Similarity measure  A more common similarity measure that is used specifically in document clustering is the cosine correlation measure, it is defined as follows:  Where:  Where (.) indicates the vector dot product ||.|| and indicates the length of the vector.  ||x|| =, x.y = 7

Overall Similarity  The overall similarity is an internal measure which is used for computing the cluster cohesiveness in the absence of any external knowledge.  It uses the weighted similarity of the internal cluster similarity, as in:  Where  Cu: is the cluster under consideration,  Sim (O1, O2): is the similarity between the two objects O1 and O2 which are belonging to the cluster Cu.  |cu| : number of documents 8

Entropy Measure  Entropy is one of the external measure, which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering.  Using entropy, the best clustering algorithm is obtained when each cluster contains exactly one data point.  While the entropy is decreasing, the quality of the clustering algorithm is better that the best quality using entropy is 0. 9

Suffix Tree Clustering algorithm  The STC method basically involves on using of a compact tree structure to represent shared phrases between documents D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and Then

Document Index Graph for clustering (DIG)  It based on constructing an incremental cumulative graph represents the collection of documents such that each node represents a term and it stores all the required information about this term while the edges represent the relation between terms to represent the phrases.  Then use the incremental clustering of documents using a histogram-based method to maximize the tightness of clusters by carefully watching the similarity distribution inside each cluster. 11

A SOM-Based document clustering using phrases  This algorithm represents the document as a vector of phrases instead of single terms.  Extracting phrases here is achieved by a phrase grammar extraction technique which based on mutual information.  Then the documents is represented as a phrase vector space, and then input them to SOM. 12 Phrase Grammar Extraction SOM Phrase Feature Vectors Documents

Text Clustering using SOM on Semantic Graphs  The semantic relations are captured by the Universal Networking Language (UNL) which expresses a document in the form of a semantic graph, with nodes as disambiguated words and semantic relations between them as edges.  Then convert these graphs into vectors and applied SOM on them as the clustering method. 13 coo plt Semantic Graph SOM Feature Vectors coo plt Documents

Graph-based Growing Hierarchal Self-Organizing Map 14 Web Documents well-structured XML documents Preprocessing Step Cumulative Document Graph Rep. 0G00G0 0G10G1 0Gs0Gs SOM 1G01G0 1G11G1 1Gs1Gs 2G12G1 2G22G2 Hierarchy Growing SOM Document Clusters XB S O L A G C D XB O A G C One Document Graph Rep. 2G12G1 2G22G2 1G01G0 1G11G1 2G12G1 2G22G2 Similarity Measure

Next:  Association Rules 15

Thanks 16