Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany

Slides:



Advertisements
Similar presentations
Liang Shan Clustering Techniques and Applications to Image Segmentation.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
1 Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems Chris Biemann University of Leipzig,
Xyleme A Dynamic Warehouse for XML Data of the Web.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models Ramesh Nallapati Joint work with John Lafferty, Amr Ahmed, William.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Clustering Unsupervised learning Generating “classes”
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Recognition using Regions (Demo) Sudheendra V. Outline Generating multiple segmentations –Normalized cuts [Ren & Malik (2003)] Uniform regions –Watershed.
HyperLex: lexical cartography for information retrieval Jean Veronis Presented by: Siddhanth Jain( ) Samiulla Shaikh( )
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Industry Relevant Problem-Telecom Subscriber ranking based on behaviour Kashyap R Puranik (CS) Arjun N Bharadwaj (EE) Joseph Joseph (EE)
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words Dmitry Davidov, Ari Rappoport The Hebrew University.
Overlapping Community Detection in Networks
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
1 New metrics for characterizing the significance of nodes in wireless networks via path-based neighborhood analysis Leandros A. Maglaras 1 Dimitrios Katsaros.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany
Graph clustering to detect network modules
Statistical Machine Translation Part II: Word Alignments and EM
Today Cluster Evaluation Internal External
Hierarchical Agglomerative Clustering on graphs
Measuring Monolinguality
Linguistic Graph Similarity for News Sentence Searching
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Discrete ABC Based on Similarity for GCP
Web News Sentence Searching Using Linguistic Graph Similarity
Parallel Density-based Hybrid Clustering
June 2017 High Density Clusters.
PC trees and Circular One Arrangements
Community detection in graphs
Finding Communities by Clustering a Graph into Overlapping Subgraphs
Compact Query Term Selection Using Topically Related Text
Language Models for Information Retrieval
Network Science: A Short Introduction i3 Workshop
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Methodology & Current Results
Learning Emoji Embeddings Using Emoji Co-Occurrence Network Graph
Information Organization: Clustering
SMEM Algorithm for Mixture Models
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
KAIST CS LAB Oh Jong-Hoon
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
Lectures on Graph Algorithms: searching, testing and sorting
N-Gram Model Formulas Word sequences Chain rule of probability
Overcoming Resolution Limits in MDL Community Detection
Group Based Management of Distributed File Caches
Ying Dai Faculty of software and information science,
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
GRAPHS Lecture 17 CS 2110 — Spring 2019.
“Traditional” image segmentation
Semantic Indexing with Typed Terms using Rapid Annotation
Presented by Nick Janus
Presentation transcript:

Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany June 9, 2006 TextGraphs 06, NYC, USA

Outline Introduction to Graph Clustering Chinese Whispers Algorithm Experiments with Synthetic Data Application of CW to Language Seperation POS clustering Word Sense Induction Extensions

Graph Clustering Find groups of nodes in undirected, weighted graphs Hierarchical Clustering vs. Flat Partitioning 3 3 3 3 4 4 3

? Desired outcomes ? Colors symbolise partitions 3 3 3 3 4 4 3

Chinese Whispers Algorithm initialize: forall vi in V: class(vi)=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; Nodes have a class and communicate it to their adjacent nodes A node adopts one of the the majority class in its neighbourhood Nodes are processed in random order for some iterations A L1 D L2 E L3 B L4 C 5 8 6 3 deg=1 deg=2 deg=3 deg=5 deg=4

Example: CW-Partitioning in two steps

Properties of CW PRO: Efficiency: CW is time-linear in the number of edges. This is bound by n² with n= number of nodes, but in real world data, graphs are much sparser Parameter-free: this includes number of clusters CON: Non-deterministic: due to random order processing and possible ties w.r.t. the majority. Does not converge: See tie example: However, the CONs are not severe for real world data... Formally hard to analyse: perform experiments

Experiment: Bi-partite cliques, unweighted Intuition: Bi-partite cliques should be split into two cliques CW can split bi-partite cliques into two parts or leave them as a whole. Measure, how often CW succeeds: the larger the graph, the saver the split -> CW meant for large graphs

Co-occurrences: A source for Graphs The entirety of all significant co-occurrences is a co-occurrence graph G(V,E) with V: Vertices = Words E: Edges (v1, v2, s) with v1, v2 words, s significance value. Co-occurrence graph is weighted by significance (here: log-likelihood) undirected Small-world-property

Application: Language Seperation Cluster the co-occurrence graph of a multilingual corpus Use words of the same class in a language identifier as lexicon Almost perfect performance

Application: Acquisition of POS-classes Distributional similarity: Words that co-occur significantly with the same neighbours should be of the same POS Clustering the second-order NB-co-occurrence graph of the BNC (excluding the top 2000 frequent words)

Results: POS-clusters In total: 282 clusters, of which 26 with more than 100 members. Syntacto-semantic motivation. Purity: 88%

Application: Word Sense Induction Co-occurrence graphs of ambigous words can be partitioned [Dorow & Widdows 03]: Leave out focus word Clusters contain context words for disambiguation

Unsupervised WSI Evaluation Framework Evaluation: For unambiguos words, merge their co-occurrence graphs and try to split them into previous parts retrieval precision (rP): similarity of the found sense with the gold standard sense retrieval recall (rR): amount of words that have been correctly assigned to the gold standard sense precision (P): fraction of correctly found disambiguations recall (R): fraction of correctly found senses 45 test words of different POS and frequency bands.

Results: WSI No parameter for expected number of clusters CW scores compareable to an algorithm especially designed for WSI

hip

hip

hip

hip

Conclusion Very effective graph partitioning algorithm for weighted, undirected graphs Possible to process really large graphs Fuzzy partitioning and hierachichal clustering possible Especially suited for small world graphs (sparse adjacency matrix) Useful in NLP applications such as Language Seperation, POS clustering, Word Sense Induction Download a GUI implementation in Java of Chinese Whispers (Open Source) at http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html

Questions ? THANK YOU

Experiment: Convergence Weighted graphs converge much faster (less ties) For weighted graphs, 15 iterations were enough to partition the 1.7M nodes / 56M edges co-occurrence graph of our main German corpus Larger graphs result in less uncertainity

Experiment: Small World Mixtures CW can seperate well if merge rate is not too high Different sizes of original SWs do not impose a problem

Experiment: Small World Mixtures CW can seperate well if merge rate is not too high Different sizes of original SWs do not impose a problem

Usages of hip FIGHT: The punching hip , be it the leading hip of a front punch or the trailing hip of a reverse punch , must swivel forwards , so that your centre-line directly faces the opponent . MUSIC: This hybrid mix of reggae and hip hop follows acid jazz , Belgian New Beat and acid swing the wholly forgettable contribution of Jive Bunny as the sound to set disco feet tapping . DANCER: Sitting back and taking it all in is another former hip hop dancer , Moet Lo , who lost his Wall Street messenger job when his firm discovered his penchant for the five-finger discount at Polo stores HOORAY: Ho , hey , ho hi , ho , hey , ho , hip hop hooray , funky , get down , a-boogie , get down . MEDICINE: We treated orthopaedic screening as a distinct category because some neonatal deformations (such as congenital dislocation of the hip ) represent only a predisposition to congenital abnormality , and surgery is avoided by conservative treatment . BODYPART-INJURY: I had a hip replacement operation on my left side , after which I immediately broke my right leg . BODYPART-CLOTHING: At his hip he wore a pistol in an ancient leather holster .