Download presentation
Presentation is loading. Please wait.
Published byMervin Cummings Modified over 9 years ago
1
PCI 2014 18th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014
2
2 SUPER 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
3
3 SUPER Hurricane Sandy 2012 20 million tweets 10pics/sec Instagram Virginia U.S. 2011 5.8 Richter 40.000 tweets hit the 1st min 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
4
Detect Topic Communities in Social Networks. Texts of Users, Social Graph, Actions (likes, follow). 4 Topic Communities 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
5
Users write texts about : Interests Habits Events in their life Cluster texts in topics => Cluster their writers in topic communities. 5 Text Clustering 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
6
6 LDA Latent Dirichlet Allocation What is the weakness? It is a bag of words model. 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
7
The sequence of words is a valuable information. Furthermore Derivative of Words are Similar Words. We need a representation model: Keeps the information of the word sequence. Captures the similarity between derivatives of words. A good solution is the N-Gram Graphs! 7 Sequence of Words 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
8
Basic Steps Input: Corpus of texts, number of Clusters k. 1.Ngram Graph that represents the Corpus. 2.Ngram Graph that represents each text. 3.Partition of the Corpus Graph (k subgraphs). 4.Comparison between each text with all partitions. 5.Allocation for each text to the cluster with the highest comparison result. Output: k Clusters with the texts which include. 8 Overview PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
9
What is the N-Grams? An N-gram is a contiguous sequence of N items from a given sequence of text. The items can be phonemes, syllables, letters, words. In our research we use letters and N=3 An example “home_phone” “hom”, “ome, “me_”, “e_p”, “_ph”, “pho”, “hon”, “one” 9 N-G RAMS (1) PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
10
NGrams are used in many applications. Approximate string matching. Find likely candidates for the correct spelling of a misspelled word. Language identification. Species identification from a small sequence of DNA. 10 N-G RAMS (2) PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
11
Nodes are all the NGrams of a text. Edges join only the neighbor NGrams. How many edges will be added is defined by a threshold. Edges : Weighted or Unweighted Directed or Undirected 11 N-Gram Graph PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
12
The 3Gram Graph “home_phone”. In this example the graph is: Undirected Weighted Each node is a 3Gram The threshold of neighbor nodes is 3 12 Example of 3-Gram Graph 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
13
13 Graph Comparison PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
14
14 Graph Partitioning k subgraphs Min number of edges between the k subgraphs. There are many graph partitioning algorithms: Kernighan–Lin algorithm Using the Edge betweenness centrality Fast Kernel-based Multilevel Algorithm for Graph Clustering A graph partition can represent a topic. 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
15
1.Random initial Partitioning 2.For each node i, we compute the cost of belonging the node i in each cluster. 3.The node i is assigned in the cluster with the min cost. 4.We iterate until none node change Cluster. 15 Fast Kernel-based Multilevel Algorithm PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
16
Reuters-21578 The most widely used test collection for text categorization research. Data set: 18457 documents belonging to 428 labels. Multi label documents: Each document belong to 0 - 29 labels. The complete method was implemented using Java SE. 16 Experimental Results PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
17
PrecisionRecallF-measure 3-Gram Graph0.28710.20460.2419 LDA0.57580.02560.0498 17 Experimental Results 3Gram Graph: recognizes the clusters which include many documents. LDA: small clusters like broken parts of the gold standard clusters. 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
18
18 Precision & Recall 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
19
19 Advantages of the 3-Gram Graphs Clustering The method can catch the sequence of words. Derivatives of a word are not handled as different words. Big clusters can be recognized and more documents can be assigned to them. It supports document partial matching and soft membership. It can capture writing characteristics of a writer. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
20
20 General notes PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
21
21 Experiment with 4Grams, 5Grams, 6Grams. Experiment with various sizes of threshold. Experiment with various graph similarity functions. Experiment with various graph partitioning algorithms. Remove Stop Words. Filter out edges which do not provide useful information. Future work 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
22
Thank you for your attention! SUPER 3 / 10 / 2014 PCI 2014 18th Panhellenic Conference in Informatics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.