Semi-Supervised Learning With Graphs William Cohen.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Overview of this week Debugging tips for ML algorithms
Albert Gatt Corpora and Statistical Methods Lecture 13.
Semi-Supervised Learning With Graphs William Cohen 1.
Graphs, Node importance, Link Analysis Ranking, Random walks
Power Iteration Clustering Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ICML , Haifa, Israel.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI ,
CS Lecture 9 Storeing and Querying Large Web Graphs.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exploratory Data Analysis on Graphs William Cohen.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one sample algorithm on graphs edges and nodes in memory nodes in memory nothing.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Graph Algorithms: Properties of Graphs? William Cohen.
Fast Effective Clustering for Graphs and Document Collections William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
A General and Scalable Approach to Mixed Membership Clustering Frank Lin ∙ William W. Cohen School of Computer Science ∙ Carnegie Mellon University December.
Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
Analysis of Social Media MLD , LTI William Cohen
Analysis of Social Media MLD , LTI William Cohen
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Single-Pass Belief Propagation
Randomized Algorithms Part 3 William Cohen 1. Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today:
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Overview of this week Debugging tips for ML algorithms Graph algorithms – A prototypical graph algorithm: PageRank In memory Putting more and more on disk.
Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Graph Algorithms - continued William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Semi-Supervised Learning With Graphs William Cohen.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Data Mining and Text Mining. The Standard Data Mining process.
Graph-based WSD の続き DMLA /7/10 小町守.
SSL (on graphs).
Semi-Supervised Clustering
Graph Algorithms (plus some review….)
Classification with Perceptrons Reading:
Node Similarity Ralucca Gera,
Spectral Clustering. Spectral Clustering k-means spectral after transformation.
Section 7.12: Similarity By: Ralucca Gera, NPS.
Learning with information of features
Logistic Regression & Parallel SGD
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Topic models for corpora and for graphs
Learning to Rank Typed Graph Walks: Local and Global Approaches
Spectral clustering methods
Analysis of Large Graphs: Overlapping Communities
Presentation transcript:

Semi-Supervised Learning With Graphs William Cohen

Administrivia Last assignment (approx PR/sweep/visualize): – 48hr extension for all Final proposal: – Still due Tues 1:30 – I’ve given everyone comments (often brief) and/or questions Final project: – We’re asking everyone to keep in touch We’re assigning each project a mentor (soon) Feel free to discuss issues in office hours Send a few bullet points each Tuesday to your mentor outlining progress (or lack of it) These are required but won’t be graded

Graph topics so far Algorithms: – exact PageRank with nodes-in-memory with nothing-in-memory – approximate personalized PageRank (apr) using repeated “push” – a “sweep” to find a coherent subgraph from a personalized PageRank vector/ranking – iterative averaging clamping some nodes no clamping Applications: – web search, etc – semi-supervised learning (MRW algorithm), etc, etc – sampling a graph (with apr) SSL power iteration clustering

Streaming PageRank Assume we can store v but not W in memory Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Store A as a row matrix: each line is – i j i,1,…,j i,d [the neighbors of i] Store v’ and v in memory: v’ starts out as cu For each line “i j i,1,…,j i,d “ – For each j in j i,1,…,j i,d v’[j] += (1-c)v[i]/d Everything needed for update is right there in row….

Recap: PageRank algorithm Repeat until converged: – Let v t+1 = cu + (1-c)Wv t Pure streaming: use a table mapping nodes to degree+pageRank – Lines are i: degree=d,pr=v For each edge i,j – Send to i (in degree/pagerank) table: outlink j For each line i: degree=d,pr=v: – send to i: incrementVBy c – for each message “outlink j”: send to j: incrementVBy (1-c)*v/d For each line i: degree=d,pr=v – sum up the incrementVBy messages to compute v’ – output new row: i: degree=d,pr=v’

Recap: Approximate Personalized PageRank

Graph topics so far Algorithms: – exact PageRank with nodes-in-memory with nothing-in-memory – approximate personalized PageRank (apr) using repeated “push” – a “sweep” to find a coherent subgraph from a personalized PageRank vector/ranking – iterative averaging clamping some nodes no clamping Applications: – web search – semi-supervised learning (MRW algorithm); etc, etc – sampling a graph (with apr) SSL power iteration clustering

proposal CMU CALO graph William 6/18/07 6/17/07 Sent To Term In Subject Learning to Search [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007]

Q: “what are Jason’s aliases?” “ Jason ” Msg 5 Msg 18 andrew.cmu.edu Sent from Sent to Jason Ernst Sent-to AddressOf cs.cmu.edu Similar to Msg 2 Sent To einat Has term inv. Basic idea: searching personal information is querying a graph for information and query is done with personalized pageRank + filtering output on a type

Tasks that are like similarity queries Person name disambiguation Threading Alias finding [ term “ andy ” file msgId ] “ person ” [ file msgId ] “ file ”  What are the adjacent messages in this thread?  A proxy for finding “ more messages like this one ” What are the -addresses of Jason ?... [ term Jason ] “ -address ” Meeting attendees finder Which -addresses (persons) should I notify about this meeting? [ meeting mtgId ] “ -address ”

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Graph topics so far Algorithms: – exact PageRank with nodes-in-memory with nothing-in-memory – approximate personalized PageRank (apr) using repeated “push” – a “sweep” to find a coherent subgraph from a personalized PageRank vector/ranking – iterative averaging clamping some nodes no clamping Applications: – web search – semi-supervised learning (MRW algorithm), etc, etc – sampling a graph (with apr) SSL power iteration clustering

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness

RWR - fixpoint of: Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class

MultiRankWalk vs wvRN/HF/CoEM

Graph topics so far Algorithms: – exact PageRank with nodes-in-memory with nothing-in-memory – approximate personalized PageRank (apr) using repeated “push” – a “sweep” to find a coherent subgraph from a personalized PageRank vector/ranking – iterative averaging clamping some nodes no clamping Applications: – web search – semi-supervised learning (MRW algorithm) – sampling a graph (with apr) SSL: HF/CoEM/wvRN power iteration clustering

Repeated averaging with neighbors on a sample problem… Create a graph, connecting all points in the 2-D initial space to all other points Weighted by distance Run power iteration for 10 steps Plot node id x vs v 10 (x) nodes are ordered by actual cluster number b b b b b gg g g g gg g g rr rr r rr … bluegreen___red___

Repeated averaging with neighbors on a sample problem… bluegreen___red___bluegreen___red___bluegreen___red___ smaller larger

Repeated averaging with neighbors on a sample problem… bluegreen___red___bluegreen___red___bluegreen___red___ bluegreen___red___bluegreen___red___

Repeated averaging with neighbors on a sample problem… very small

PIC: Power Iteration Clustering run power iteration (repeated averaging w/ neighbors) with early stopping –V 0 : random start, or “degree matrix” D, or … –Easy to implement and efficient –Very easily parallelized –Experimentally, often better than traditional spectral methods –Surprising since the embedded space is 1-dimensional!

Harmonic Functions/CoEM/wvRN then replace v t+1 (i) with seed values +1/-1 for labeled data for 5-10 iterations Classify data using final values from v

Experiments “Network” problems: natural graph structure – PolBooks: 105 political books, 3 classes, linked by copurchaser – UMBCBlog: 404 political blogs, 2 classes, blogroll links – AGBlog: 1222 political blogs, 2 classes, blogroll links “ Manifold” problems: cosine distance between classification instances – Iris: 150 flowers, 3 classes – PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) – 20ngA: 200 docs, misc.forsale vs soc.religion.christian – 20ngB: 400 docs, misc.forsale vs soc.religion.christian – 20ngC: 20ngB docs from talk.politics.guns – 20ngD: 20ngC docs from rec.sport.baseball

Experimental results: best-case assignment of class labels to clusters

Why I’m talking about graphs Lots of large data is graphs – Facebook, Twitter, citation data, and other social networks – The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks – Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks nodes = documents or words links connect document  word or word  document – Computer networks, biological networks (proteins, ecosystems, brains, …), … – Heterogeneous networks with multiple types of nodes people, groups, documents

proposal CMU CALO graph William 6/18/07 6/17/07 Sent To Term In Subject Learning to Search [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007]

proposal CMU CALO graph William Simplest Case: Bi-partite Graphs

Outline Background on spectral clustering “Power Iteration Clustering” – Motivation – Experimental results Analysis: PIC vs spectral methods PIC for sparse bipartite graphs – “Lazy” Distance Computation – “Lazy” Normalization – Experimental Results

Motivation: Experimental Datasets are… “Network” problems: natural graph structure – PolBooks: 105 political books, 3 classes, linked by copurchaser – UMBCBlog: 404 political blogs, 2 classes, blogroll links – AGBlog: 1222 political blogs, 2 classes, blogroll links – Also: Zachary’s karate club, citation networks,... “ Manifold” problems: cosine distance between all pairs of classification instances – Iris: 150 flowers, 3 classes – PenDigits01,17: 200 handwritten digits, 2 classes (0-1 or 1-7) – 20ngA: 200 docs, misc.forsale vs soc.religion.christian – 20ngB: 400 docs, misc.forsale vs soc.religion.christian – … Gets expensive fast

Sp ectral Clustering: Graph = Matrix A*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A_111 B1_1 C11_ D_11 E1_1 F11_ G_11 H_11 I11_1 J111_ A B C F D E G I J A3 B2 C3 D E F G H I J M A v1v1 A2*1+3*1+0*1 B3*1+3*1 C3*1+2*1 D E F G H I J v2v2 *=

Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” H ABCDEFGHIJ A _.5.3 B _.5 C.3.5_ D _.3 E.5_.3 F.5 _ G _.3 H _ I.5 _.3 J.5.3_ A B C F D E G I J A3 B2 C3 D E F G H I J W v1v1 A 2*.5+3*.5+0*.3 B 3*.3+3*.5 C 3*.33+2*.5 D E F G H I J v2v2 * = W: normalized so columns sum to 1 W = D -1 *A D[i,i]=1/degree(i)

Lazy computation of distances and normalizers Recall PIC’s update is – v t = W * v t-1 = = D -1 A * v t-1 – …where D is the [diagonal] degree matrix: D=A*1 My favorite distance metric for text is length- normalized TFIDF: – Def’n: A(i,j)= /||v i ||*||v j || – Let N(i,i)=||v i || … and N(i,j)=0 for i!=j – Let F(i,k)=TFIDF weight of word w k in document v i – Then: A = N -1 F T FN -1 =inner product ||u|| is L2-norm 1 is a column vector of 1’s

Lazy computation of distances and normalizers Recall PIC’s update is – v t = W * v t-1 = = D -1 A * v t-1 – …where D is the [diagonal] degree matrix: D=A*1 – Let F(i,k)=TFIDF weight of word w k in document v i – Compute N(i,i)=||v i || … and N(i,j)=0 for i!=j – Don’t compute A = N -1 F T FN -1 – Let D(i,i)= N -1 F T FN -1 *1 where 1 is an all-1’s vector Computed as D=N -1( F T( F(N -1 * 1))) for efficiency – New update: v t = D -1 A * v t-1 = D -1 N -1 F T FN -1 * v t-1 Equivalent to using TFIDF/cosine on all pairs of examples but requires only sparse matrices

Experimental results RCV1 text classification dataset – 800k + newswire stories – Category labels from industry vocabulary – Took single-label documents and categories with at least 500 instances – Result: 193,844 documents, 103 categories Generated 100 random category pairs – Each is all documents from two categories – Range in size and difficulty – Pick category 1, with m 1 examples – Pick category 2 such that 0.5m 1 <m 2 <2m 1

Results NCUTevd: Ncut with exact eigenvectors NCUTiram: Implicit restarted Arnoldi method No stat. signif. diffs between NCUTevd and PIC

Results

Linear run-time implies constant number of iterations Number of iterations to “acceleration- convergence” is hard to analyze: – Faster than a single complete run of power iteration to convergence – On our datasets iterations is typical is exceptional

Implicit Manifolds on the NELL datasets Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds arrogance traits such as arg1 denial selfishness

Using the Manifold Trick for SSL

A smoothing trick:

Using the Manifold Trick for SSL