Semi-Supervised Learning With Graphs William Cohen 1.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Overview of this week Debugging tips for ML algorithms

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.

Support Vector Machines

Machine learning continued Image source:

Generative Topic Models for Community Analysis

Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

GraphLab A New Parallel Framework for Machine Learning Carnegie Mellon Based on Slides by Joseph Gonzalez Mosharaf Chowdhury.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

HCC class lecture 22 comments John Canny 4/13/05.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Pregel: A System for Large-Scale Graph Processing

Advanced Algorithms Piyush Kumar (Lecture 5: Weighted Matching) Welcome to COT5405 Based on Kevin Wayne’s slides.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,

Random Walk with Restart (RWR) for Image Segmentation

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Efficient Logistic Regression with Stochastic Gradient Descent – part 2 William Cohen.

Text Clustering.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer.

PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki

Semi-Supervised Learning With Graphs William Cohen.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2.

Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.

Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.

Logistic Regression William Cohen.

Single-Pass Belief Propagation

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Efficient Logistic Regression with Stochastic Gradient Descent

Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Review William Cohen. Sessions Randomized algorithms – The “hash trick” – Bloom filters how would you use it? how and why does it work? – Count-min sketch.

Matrix Factorization 1. Recovering latent factors in a matrix m columns v11 … …… vij … vnm n rows 2.

Semi-Supervised Learning With Graphs William Cohen.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.

Misc Announcements/Quiz. Ready for some real ML work? We are looking for programmers for the edigs project (edigs.org). This is a soon-to-be-deployed.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

SSL (on graphs).

Semi-Supervised Clustering

PREGEL Data Management in the Cloud

CSC 594 Topics in AI – Natural Language Processing

CSE 454 Advanced Internet Systems University of Washington

MapReduce Simplied Data Processing on Large Clusters

Learning with information of features

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

Logistic Regression & Parallel SGD

Neuro-Computing Lecture 4 Radial Basis Function Network

Semi-Supervised Learning

Presentation transcript:

Semi-Supervised Learning With Graphs William Cohen 1

HW7 Hints m movies n users m movies x1y1 x2y2.. …… xnyn a1a2..…am b1b2……bm v11 … …… vij … vnm ~ V[i,j] = user i’s rating of movie j r W H V 2

Homework 7 - Hints What’s parallel and what’s sequential? for epoch t=1,….,T in sequence – for stratum s=1,…,K in sequence for block b=1,… in stratum s in parallel – for triple (i,j,rating) in block b in sequence » do SGD step for (i,j,rating) Hint 1: You need some loops around rdd operations Hint 2: Look at rdd.mapPartitions() to see how to control a parallel/sequential breakdown like the inner loop only needs part of W,H this needs r/w access to (part of) W and H How? 3

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory as shared synchronized memory: you should worry about transmission cost and blocking Abhinav hint: Python dictionaries are thread-safe 4

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory as broadcast memory: workers get unsynchronized copies of the data: worry here is extra broadcast cost hint: mapPartitions ~= Hadoop map 5

Homework 7 - Hints What is rdd.mapPartitions(foo)? Function foo is an iterator that loops over the entries in a partition (aka shard) of rdd and yields a sequence of entries: def foo(iterator): doSomething() #setup for x in iterator … yield … … doAnotherThing() #breakdown 6

Homework 7 - Hints What is rdd.mapPartitions(foo)? Function foo is an iterator that loops over the entries in a partition (aka shard) of rdd and yields a sequence of entries: def foo(iterator): doSomething() #setup: modifiedEntries = set() for (i,j,r) in iterator … yield … # update copies of W[i,*], H[*,j] … doAnotherThing() #output modified entries of W,H input: broadcast W,H, ratings Vij output: updated W,H 7

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory shared memory individual copies of H, W memory – W,H on disk, joined in with (i,j,r) should worry about inflating the size of the ratings – tis scheme inserts many copies of each W – Hybrid schemes: triples (i,j,r)  (i,{(j1,r1),(j2,r2),…,}, W[i,*]) H in memory 8

Homework 7 - Hints Change grain size of computation Rather than streaming through (r,i,j) tasks: – create SGD subtasks W1,H1,V11 W2,H2,V22 W3,H3,V33 – each map task does SGD on one subtask Should worry about subtask size (part of ratings are in memory) 9

Homework 7 - Hints What’s in memory and what’s on disk? ratings triples (i,j,r) on disk – W,H in memory shared memory individual copies of H, W memory – W,H on disk, joined in with (i,j,r) should worry about inflating the size of the ratings – tis scheme inserts many copies of each W – Hybrid schemes: triples (i,j,r)  (i,{(j1,r1),(j2,r2),…,}, W[i,*]) H in memory 10

Homework 7 – Avuncular Advice This is a really interesting assignment – think and learn! …But don’t stress out We know this assignment is challenging and we’ll curve it appropriately HW8 is being reduced in scope PS. If you’re an undergrad doing carnival stuff talk to me about the (useless) extension 11

Semi-Supervised Learning With Graphs William Cohen 12

Flashback – Graph Algorithms so far…. PageRank and how to scale it up Personalized PageRank/Random Walk with Restart and – how to implement it – how to use it for extracting part of a graph Other uses for graphs? – not so much 13

Main topics today Scalable semi-supervised learning on graphs – SSL with RWR – SSL with coEM/wvRN/HF Scalable unsupervised learning on graphs – Power iteration clustering – … 14

Semi-supervised learning A pool of labeled examples L A (usually larger) pool of unlabeled examples U Can you improve accuracy somehow using U? 15

Semi-Supervised Bootstrapped Learning/Self-training Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 anxiety selfishness Berlin Extract cities: 16

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness 17

Semi-Supervised Bootstrapped Learning via Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness 18

ASONAM-2010 (Advances in Social Networks Analysis and Mining) 19

Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 20

RWR - fixpoint of: Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class Seed selection 1.order by PageRank, degree, or randomly 2.go down list until you have at least k examples/class 21

Results – Blog data Random Degree PageRank We’ll discuss this soon…. 22

Results – More blog data Random Degree PageRank 23

Results – Citation data Random Degree PageRank 24

Seeding – MultiRankWalk 25

Seeding – HF/wvRN 26

What is HF aka coEM aka wvRN? 27

CoEM/HF/wvRN One definition [MacKassey & Provost, JMLR 2007]:… Another definition: A harmonic field – the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors’ scores; [X. Zhu, Z. Ghahramani, and J. Lafferty, ICML 2003] Another definition: A harmonic field – the score of each node in the graph is the harmonic (linearly weighted) average of its neighbors’ scores; [X. Zhu, Z. Ghahramani, and J. Lafferty, ICML 2003] 28

CoEM/wvRN/HF Another justification of the same algorithm…. … start with co-training with a naïve Bayes learner 29

CoEM/wvRN/HF One algorithm with several justifications…. One is to start with co-training with a naïve Bayes learner And compare to an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples 30

CoEM/wvRN/HF A second experiment – each + example: concatenate features from two documents, one of class A+, one of class B+ – each - example: concatenate features from two documents, one of class A-, one of class B- – features are prefixed with “A”, “B”  disjoint 31

CoEM/wvRN/HF A second experiment – each + example: concatenate features from two documents, one of class A+, one of class B+ – each - example: concatenate features from two documents, one of class A-, one of class B- – features are prefixed with “A”, “B”  disjoint NOW co-training outperforms EM 32

CoEM/wvRN/HF Co-training with a naïve Bayes learner vs an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples vs an EM version of naïve Bayes – E: soft-classify unlabeled examples with NB classifier – M: re-train classifier with soft-labeled examples incremental hard assignments iterative soft assignments 33

Co-Training Rote Learner My advisor pages hyperlinks

Co-EM Rote Learner: equivalent to HF on a bipartite graph Pittsburgh contexts NPs lives in _ 35

What is HF aka coEM aka wvRN? Algorithmically: HF propagates weights and then resets the seeds to their initial value MRW propagates weights and does not reset seeds Algorithmically: HF propagates weights and then resets the seeds to their initial value MRW propagates weights and does not reset seeds 36

MultiRank Walk vs HF/wvRN/CoEM Seeds are marked S MRWHF 37

Back to Experiments: Network Datasets with Known Classes UBMCBlog AGBlog MSPBlog Cora Citeseer 38

MultiRankWalk vs wvRN/HF/CoEM 39

How well does MWR work? 40

Parameter Sensitivity 41

Semi-supervised learning A pool of labeled examples L A (usually larger) pool of unlabeled examples U Can you improve accuracy somehow using U? These methods are different from EM – optimizes Pr(Data|Model) How do SSL learning methods (like label propagation) relate to optimization? 42

SSL as optimization slides from Partha Talukdar 43

44

yet another name for HF/wvRN/coEM 45

match seeds smoothness prior 46

47

48

49

How to do this minimization? First, differentiate to find min is at Jacobi method: To solve Ax=b for x Iterate: … or: 50

51

52

precision- recall break even point /HF/… 53

/HF/… 54

/HF/… 55

from mining patterns like “musicians such as Bob Dylan” from HTML tables on the web that are used for data, not formatting 56

57

58

More recent work (AIStats 2014) Propagating labels requires usually small number of optimization passes – Basically like label propagation passes Each is linear in – the number of edges – and the number of labels being propagated Can you do better? – basic idea: store labels in a countmin sketch – which is basically an compact approximation of an object  double mapping 59

Flashback: CM Sketch Structure Each string is mapped to one bucket per row Estimate A[j] by taking min k { CM[k,h k (j)] } Errors are always over-estimates Sizes: d=log 1/ , w=2/   error is usually less than  ||A|| 1 +c h 1 (s) h d (s) d=log 1/  w = 2/  from: Minos Garofalakis 60

More recent work (AIStats 2014) Propagating labels requires usually small number of optimization passes – Basically like label propagation passes Each is linear in – the number of edges – and the number of labels being propagated – the sketch size – sketches can be combined linearly without “unpacking” them: sketch(av + bw) = a*sketch(v)+b*sketch(w) – sketchs are good at storing skewed distributions 61

More recent work (AIStats 2014) Label distributions are often very skewed – sparse initial labels – community structure: labels from other subcommunities have small weight 62

More recent work (AIStats 2014) Freebase Flick-10k “self-injection”: similarity computation 63

More recent work (AIStats 2014) Freebase 64

More recent work (AIStats 2014) 100 Gb available 65