Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining 2013 www.cst.ps/staff/mfarra.

Similar presentations


Presentation on theme: "Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining 2013 www.cst.ps/staff/mfarra."— Presentation transcript:

1 Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining 2013 www.cst.ps/staff/mfarra Chapter 6_3: Clustering Methods

2 Course’s Out Lines  Introduction  Data Preparation and Preprocessing  Data Representation  Classification Methods  Evaluation  Clustering Methods  Mid Exam  Association Rules  Knowledge Representation  Special Case study : Document clustering  Discussion of Case studies by students 2

3 Out Lines  Cluster validation  Similarity measure  Overall Similarity  Entropy  Examples of Document Clustering algorithm  Suffix Tree Clustering algorithm  DIG for document representation  A SOM-Based document clustering using phrases  Text Clustering using SOM on Semantic Graphs  Graph-based Growing Hierarchal Self-Organizing Map 3

4 Cluster validation- intro. 4  The results of any clustering algorithm should be evaluated using an informative quality measure that reflects the “goodness” of the resulting clusters.  In addition, it gives us the possibility to compare the different clustering algorithms that different approaches usually lead to different clusters.

5 Cluster validation- intro.  Generally, there are two main measures for testing quality of clusters, using of them depends on whether we have labeled data or there is no prior knowledge about the classification of data objects. 5 Measure ExampleMeasure Type Overall similarityInternal quality measure Entropy and F-measureExternal quality measure

6 Similarity measure  Similarity in VSM is based on the distance or angle between two vectors. One of the most widely used distance measure is known as the family of Minkowski distances which described as:  Where:  X and Y the vectors of two objects  i : the feature, n number of features  p: assumes values greater than or equal to 1 6

7 Similarity measure  A more common similarity measure that is used specifically in document clustering is the cosine correlation measure, it is defined as follows:  Where:  Where (.) indicates the vector dot product ||.|| and indicates the length of the vector.  ||x|| =, x.y = 7

8 Overall Similarity  The overall similarity is an internal measure which is used for computing the cluster cohesiveness in the absence of any external knowledge.  It uses the weighted similarity of the internal cluster similarity, as in:  Where  Cu: is the cluster under consideration,  Sim (O1, O2): is the similarity between the two objects O1 and O2 which are belonging to the cluster Cu.  |cu| : number of documents 8

9 Entropy Measure  Entropy is one of the external measure, which provides a measure of “goodness” for un-nested clusters or for the clusters at one level of a hierarchical clustering.  Using entropy, the best clustering algorithm is obtained when each cluster contains exactly one data point.  While the entropy is decreasing, the quality of the clustering algorithm is better that the best quality using entropy is 0. 9

10 Suffix Tree Clustering algorithm  The STC method basically involves on using of a compact tree structure to represent shared phrases between documents. 10 2015-10-2110 D1: cat ate cheese D2: mouse ate cheese too D3: cat ate mouse too and Then

11 Document Index Graph for clustering (DIG)  It based on constructing an incremental cumulative graph represents the collection of documents such that each node represents a term and it stores all the required information about this term while the edges represent the relation between terms to represent the phrases.  Then use the incremental clustering of documents using a histogram-based method to maximize the tightness of clusters by carefully watching the similarity distribution inside each cluster. 11

12 A SOM-Based document clustering using phrases  This algorithm represents the document as a vector of phrases instead of single terms.  Extracting phrases here is achieved by a phrase grammar extraction technique which based on mutual information.  Then the documents is represented as a phrase vector space, and then input them to SOM. 12 Phrase Grammar Extraction.................... SOM Phrase Feature Vectors Documents

13 Text Clustering using SOM on Semantic Graphs  The semantic relations are captured by the Universal Networking Language (UNL) which expresses a document in the form of a semantic graph, with nodes as disambiguated words and semantic relations between them as edges.  Then convert these graphs into vectors and applied SOM on them as the clustering method. 13 coo plt Semantic Graph.................... SOM Feature Vectors coo plt Documents

14 Graph-based Growing Hierarchal Self-Organizing Map 14 Web Documents well-structured XML documents Preprocessing Step Cumulative Document Graph Rep. 0G00G0 0G10G1 0Gs0Gs SOM 1G01G0 1G11G1 1Gs1Gs 2G12G1 2G22G2 Hierarchy Growing SOM Document Clusters XB S O L A G C D XB O A G C One Document Graph Rep. 2G12G1 2G22G2 1G01G0 1G11G1 2G12G1 2G22G2 Similarity Measure

15 Next:  Association Rules 15

16 Thanks 16


Download ppt "Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining 2013 www.cst.ps/staff/mfarra."

Similar presentations


Ads by Google