Correlation Clustering

Slides:



Advertisements
Similar presentations
Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Advertisements

Approximation algorithms for geometric intersection graphs.
Triangle partition problem Jian Li Sep,2005.  Proposed by Redstar in Algorithm board in Fudan BBS.  Motivated by some network design strategy.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Computing Kemeny and Slater Rankings Vincent Conitzer (Joint work with Andrew Davenport and Jayant Kalagnanam at IBM Research.)
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.
Item Pricing for Revenue Maximization in Combinatorial Auctions Maria-Florina Balcan, Carnegie Mellon University Joint with Avrim Blum and Yishay Mansour.
Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira Carnegie Mellon.
Shuchi Chawla, Carnegie Mellon University Static Optimality and Dynamic Search Optimality in Lists and Trees Avrim Blum Shuchi Chawla Adam Kalai 1/6/2002.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
Expectation Maximization Algorithm
Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Approximation algorithms for Path-Planning and Clustering problems on graphs (Thesis Proposal) Shuchi Chawla Carnegie Mellon University.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.
1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)
Approximation Algorithms for Path-Planning Problems with Nikhil Bansal, Avrim Blum and Adam Meyerson Shuchi Chawla Carnegie Mellon University.
Algorithms for Path-Planning Shuchi Chawla (CMU/Stanford/Wisconsin) 10/06/05.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Probabilistic Equational Reasoning Arthur Kantor
Approximation Algorithms for Path-Planning Problems Nikhil Bansal, Avrim Blum, Shuchi Chawla and Adam Meyerson Carnegie Mellon University.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Clustering Data Streams A presentation by George Toderici.
Shuchi Chawla, Carnegie Mellon University Guessing Secrets Efficiently Shuchi Chawla 1/23/2002.
Provable Learning of Noisy-OR Networks
Semi-Supervised Clustering
Learning Deep Generative Models by Ruslan Salakhutdinov
Introduction to Approximation Algorithms
Groups of vertices and Core-periphery structure
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Generalized Sparsest Cut and Embeddings of Negative-Type Metrics
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Approximation Algorithms for Path-Planning Problems
Approximating the MST Weight in Sublinear Time
Evaluation of IR Systems
Markov Random Fields with Efficient Approximations
Maximum Matching in the Online Batch-Arrival Model
Haim Kaplan and Uri Zwick
June 2017 High Density Clusters.
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
Analysis and design of algorithm
Data Integration with Dependent Sources
k-center Clustering under Perturbation Resilience
Haim Kaplan and Uri Zwick
Computational Learning Theory
On the effect of randomness on planted 3-coloring models
Introduction Wireless Ad-Hoc Network
Computational Learning Theory
Consensus Partition Liang Zheng 5.21.
Chapter 11 Limitations of Algorithm Power
3.3 Network-Centric Community Detection
Integer Programming (정수계획법)
Embedding Metrics into Geometric Spaces
the k-cut problem better approximate and exact algorithms
Clustering.
15th Scandinavian Workshop on Algorithm Theory
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

Document Clustering Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure them out as part of the algorithm Shuchi Chawla, Carnegie Mellon University

Research Communities Given data on research papers, divide researchers into communities by co-authorship Typical characteristics: How to divide really depends on the given set of researchers Fuzzy boundaries Shuchi Chawla, Carnegie Mellon University

Traditional Approaches to Clustering Approximation algorithms k-means, k-median, k-min sum Matrix methods Spectral Clustering AI techniques EM, classification algorithms Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches Dependence on underlying metric Objective functions are meaningless without a metric eg. k-means Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches Fixed number of clusters Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Shuchi Chawla, Carnegie Mellon University

Problems with traditional approaches No clean notion of “quality” of clustering Objective functions do not directly translate to how many items have been grouped wrongly Heuristic approaches Objective functions derived from generative models Shuchi Chawla, Carnegie Mellon University

Cohen, McCallum & Richman’s idea “Learn” a similarity measure on documents may not be a metric! f(x,y) = amount of similarity between x and y Use labeled data to train up this function Classify all pairs with the learned function Find the “most consistent” clustering Our Task Shuchi Chawla, Carnegie Mellon University

An example Consistent clustering: + edges inside clusters Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Consistent clustering: + edges inside clusters - edges between clusters Shuchi Chawla, Carnegie Mellon University

An example Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Disagreement H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

An example Harry B. Harry Bovik +: Same -: Different Disagreement H. Bovik Tom X. Task: Find most consistent clustering or, fewest possible disagreements equivalently, maximum possible agreements Shuchi Chawla, Carnegie Mellon University

Correlation clustering Given a complete graph – Each edge labeled ‘+’ or ‘-’ Our measure of clustering – How many labels does it agree with? Number of clusters depends on the edge labels NP-complete; We consider approximations Shuchi Chawla, Carnegie Mellon University

Compared to traditional approaches… Do not have to specify k No condition on weights – can be arbitrary Clean notion of quality of clustering – number of examples where the clustering differs from f If a good (perfect) clustering exists, it is easy to find Shuchi Chawla, Carnegie Mellon University

Some machine learning justification Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function Agnostic Learning There is no inherent clustering Try to find the best representation using a hypothesis with limited expressivity Shuchi Chawla, Carnegie Mellon University

Our results Constant factor approximation for minimizing disagreements PTAS for maximizing agreements Results for the random noise case Shuchi Chawla, Carnegie Mellon University

Minimizing Disagreements Goal: constant approximation Problem: Even if we find a cluster as good as one in OPT, we are headed towards a log n approximation (a set-cover like bound) Idea: lower bound DOPT Shuchi Chawla, Carnegie Mellon University

Lower Bounding Idea: Bad Triangles Consider + - “Bad Triangle” + We know any clustering has to disagree with at least one of these edges. Shuchi Chawla, Carnegie Mellon University

Lower Bounding Idea: Bad Triangles If several edge-disjoint bad triangles, then any clustering makes a mistake on each one - + + 1 2 Edge disjoint Bad Triangles (1,2,3), (1,4,5) 5 2 4 3 Dopt  #{Edge disjoint bad triangles} Shuchi Chawla, Carnegie Mellon University

Using the lower bound d-clean cluster: cluster C where each node has fewer than d|C| “bad” edges d-clean clusters have few bad triangles => few mistakes Possible solution: find a d-clean clustering Caveat: It may not exist Shuchi Chawla, Carnegie Mellon University

Using the lower bound Caveat: A d-clean clustering may not exist We show:  a clustering with clusters that are d-clean or singleton Further, it has few mistakes Nice structure helps us find it easily. Shuchi Chawla, Carnegie Mellon University

Maximizing Agreements Easy to obtain a 2-approximation If #(pos. edges) > #(neg. edges) everything in one cluster Otherwise, n singleton clusters Get at least half the edges correct Max score possible = total number of edges 2-approximation ! Shuchi Chawla, Carnegie Mellon University

Maximizing Agreements Max possible score = ½n2 Goal: obtain an additive approx of en2 Standard approach: Draw small sample Guess partition of sample Compute partition of remainder Running time doubly exp’l in e, or singly with bad exponent. Shuchi Chawla, Carnegie Mellon University

Extensions & Open Problems Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut Better constant for unweighted case Can we use bad triangles (or polygons) more directly for a tighter bound? Experimental performance Shuchi Chawla, Carnegie Mellon University

Other problems I have worked on Game Theory and Mechanism Design Approx for Orienteering & related problems Online search algorithms based on Machine Learning approaches Theoretical properties of Power Law graphs Currently working on Privacy with Cynthia Shuchi Chawla, Carnegie Mellon University

Thanks! Shuchi Chawla, Carnegie Mellon University

using the lower bound: delta clean clusters give proof that delta clean is \leq 4 opt Shuchi Chawla, Carnegie Mellon University

proof outline delta clean <= 4opt but there may not be a delta clean clustering!! show that there is a clustering that is close to delta clean – clusters are either delta clean or singleton there exists such a clustering close to opt we will try to find this clustering (copy nikhil’s slide) Shuchi Chawla, Carnegie Mellon University

existence of opt(delta) proof Shuchi Chawla, Carnegie Mellon University

algorithm pictorially – use nikhil’s slides brief outline of how it does that Shuchi Chawla, Carnegie Mellon University

bounding the cost clusters containing opt(delta)’s clusters are ¼ clean rest have at most as many mistakes as opt(delta) Shuchi Chawla, Carnegie Mellon University

random noise? Shuchi Chawla, Carnegie Mellon University