Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Advertisements

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Machine learning continued Image source:
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin Language Technologies Institute School of Computer Science Carnegie.
A Very Fast Method for Clustering Big Text Datasets Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University ECAI ,
Semi-Supervised Classification of Network Data Using Very Few Labels
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Three kinds of learning
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Adversarial Information Retrieval The Manipulation of Web Content.
Stochastic k-Neighborhood Selection for Supervised and Unsupervised Learning University of Toronto Machine Learning Seminar Feb 21, 2013 Kevin SwerskyIlya.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
Random Walk with Restart (RWR) for Image Segmentation
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning Frank Lin PhD Thesis Oral ∙ July 24, 2012 Committee ∙ William W. Cohen ∙ Christos.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Algorithmic Detection of Semantic Similarity WWW 2005.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Post-Ranking query suggestion by diversifying search Chao Wang.
Kijung Shin Jinhong Jung Lee Sael U Kang
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Semi-Supervised Learning With Graphs William Cohen.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Graph-based WSD の続き DMLA /7/10 小町守.
Exploring Social Tagging Graph for Web Object Classification
Information Organization: Overview
Semi-Supervised Clustering
Large-Scale Content-Based Audio Retrieval from Text Queries
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
DTMC Applications Ranking Web Pages & Slotted ALOHA
CSE 454 Advanced Internet Systems University of Washington
ML – Lecture 3B Deep NN.
Learning to Rank Typed Graph Walks: Local and Global Approaches
Information Organization: Overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data Frank Lin and William W. Cohen School of Computer Science, Carnegie Mellon University KDD-MLG , San Diego, CA, USA

Overview Graph-based SSL The Problem with Text Data Implicit Manifolds – Two GSSL Methods – Three Similarity Functions A Framework Results Related Work

Graph-based SSL Graph-based semi-supervised learning methods can often be viewed as methods for propagating labels along edges of a graph. Naturally they are a good fit for network data Labeled class A Labeled Class B How to label the rest of the nodes in the graph?

The Problem with Text Data Documents are often represented as feature vectors of words: The importance of a Web page is an inherently subjective matter, which depends on the readers… In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use… You're not cool just because you have a lot of followers on twitter, get over yourself… coolwebsearchmakeoveryou

The Problem with Text Data Feature vectors are often sparse But similarity matrix is not! coolwebsearchmakeoveryou Mostly zeros - any document contains only a small fraction of the vocabulary Mostly non-zero - any two documents are likely to have a word in common

The Problem with Text Data A similarity matrix is the input to many GSSL methods How can we apply GSSL methods efficiently? O(n 2 ) time to construct O(n 2 ) space to store > O(n 2 ) time to operate on Too expensive! Does not scale up to big datasets!

The Problem with Text Data Solutions: 1.Make the matrix sparse 2.Implicit Manifold But this is what we’ll talk about! A lot of cool work has gone into how to do this… (If you do SSL on large-scale non- network data, there is a good chance you are using this already...)

Implicit Manifolds A pair-wise similarity matrix is a manifold under which the data points “reside”. It is implicit because we never explicitly construct the manifold (pair-wise similarity), although the results we get are exactly the same as if we did. 8 What do you mean by…? Sounds too good. What’s the catch?

Implicit Manifolds Two requirements for using implicit manifolds on text (-like) data: 1.The GSSL method can be implemented with matrix-vector multiplications 2.We can decompose the dense similarity matrix into a series of sparse matrix-matrix multiplications 9 Let’s look at some specifics, starting with two GSSL methods As long as they are met, we can obtain the exact same results without ever constructing a similarity matrix!

Two GSSL Methods Harmonic functions method (HF) MultiRankWalk (MRW) If you’ve never heard of them… 10 You might have heard one of these:  Gaussian fields and harmonic functions classifier (Zhu et al. 2003)  Weighted-voted relational network classifier (Macskassy & Provost 2007)  Weakly-supervised classification via random walks (Talukdar et al. 2008)  Adsorption (Baluja et al. 2008)  Learning on diffusion maps (Lafon & Lee 2006)  and others … HF’s cousins:  Partially labeled classification using Markov random walks (Szummer & Jaakkola 2001)  Learning with local and global consistency (Zhou et al. 2004)  Graph-based SSL as a generative model (He et al. 2007)  Ghost edges for classification (Gallagher et al. 2008)  and others … MRW’s cousins: Propagation via forward random walks ( w/ restart ) Propagation via backward random walks

Two GSSL Methods Harmonic functions method (HF) 11 MultiRankWalk (MRW) In both of these iterative implementations, the core computation are matrix-vector multiplications! Let’s look at their implicit manifold qualification…

Three Similarity Functions Inner product similarity Cosine similarity Bipartite graph walk 12 Simple, good for binary features Often used for document categorization Can be viewed as a bipartite graph walk, or as feature reweighting by relative frequency; related to TF-IDF term weighting… Side note: perhaps all the cool work on matrix factorization could be useful here too…

Putting Them Together Example: HF + inner product similarity: Iteration update becomes: Construction: O(n) Storage: O(n) Operation: O(n) How about a similarity function we actually use for text data? 13 Diagonal matrix D can be calculated as: D (i,i) =d (i) where d=FF T 1 Parentheses are important!

Putting Them Together Example: HF + cosine similarity: Iteration update: Diagonal cosine normalizing matrix Compact storage: we don’t need a cosine- normalized version of the feature vectors 14 Diagonal matrix D can be calculated as: D (i,i) =d (i) where d=N c FFTN c 1 Construction: O(n) Storage: O(n) Operation: O(n)

A Framework 15 So what’s the point? 1.Towards a principled way for researchers to apply GSSL methods to text data, and the conditions under which they can do this efficiently 2.Towards a framework on which researchers can develop and discover new methods (and recognizing old ones) 3.Building a SSL tool set – pick one that works for you, according to all the great work that has been done

A Framework 16 Choose your SSL Method… Harmonic Functions MultiRankWalk Hmm… I have a large dataset with very few training labels, what should I try? How about MultiRankWalk with a low restart probability?

A Framework 17 But the documents in my dataset are kinda long… Can’t go wrong with cosine similarity! … and pick your similarity function Inner Product Cosine Similarity Bipartite Graph Walk

Results On 2 commonly used document categorization datasets On 2 less common NP classification datasets Goal: to show they do work on large text datasets, and consistent with what we know about these SSL methods and similarity functions 18

Results Document categorization 19

Results Noun phrase classification dataset: 20

Questions?

Additional Information 22

MRW: RWR for Classification We refer to this method as MultiRankWalk: it classifies data with multiple rankings using random walks

MRW: Seed Preference Obtaining labels for data points is expensive We want to minimize cost for obtaining labels Observations: – Some labels inherently more useful than others – Some labels easier to obtain than others Question: “Authoritative” or “popular” nodes in a network are typically easier to obtain labels for. But are these labels also more useful than others?

Seed Preference Consider the task of giving a human expert (or posting jobs on Amazon Mechanical Turk) a list of data points to label The list (seeds) can be generated uniformly at random, or we can have a seed preference, according to simple properties of the unlabeled data We consider 3 preferences: – Random – Link Count – PageRank Nodes with highest counts make the list Nodes with highest scores make the list

MRW: The Question What really makes MRW and wvRN different? Network-based SSL often boil down to label propagation. MRW and wvRN represent two general propagation methods – note that they are call by many names: MRWwvRN Random walk with restartReverse random walk Regularized random walkRandom walk with sink nodes Personalized PageRankHitting time Local & global consistencyHarmonic functions on graphs Iterative averaging of neighbors Great…but we still don’t know why the differences in their behavior on these network datasets!

MRW: The Question It’s difficult to answer exactly why MRW does better with a smaller number of seeds. But we can gather probable factors from their propagation models: MRWwvRN 1Centrality-sensitiveCentrality-insensitive 2 Exponential drop-off / damping factor No drop-off / damping 3 Propagation of different classes done independently Propagation of different classes interact

MRW: The Question An example from a political blog dataset – MRW vs. wvRN scores for how much a blog is politically conservative: 1.000neoconservatives.blogspot.com 1.000strangedoctrines.typepad.com 1.000jmbzine.com 0.593presidentboxer.blogspot.com 0.585rooksrant.com 0.568purplestates.blogspot.com 0.553ikilledcheguevara.blogspot.com 0.540restoreamerica.blogspot.com 0.539billrice.org 0.529kalblog.com 0.517right-thinking.com 0.517tom-hanna.org 0.514crankylittleblog.blogspot.com 0.510hasidicgentile.org 0.509stealthebandwagon.blogspot.com 0.509carpetblogger.com 0.497politicalvicesquad.blogspot.com 0.496nerepublican.blogspot.com 0.494centinel.blogspot.com 0.494scrawlville.com 0.493allspinzone.blogspot.com 0.492littlegreenfootballs.com 0.492wehavesomeplanes.blogspot.com 0.491rittenhouse.blogspot.com 0.490secureliberty.org 0.488decision08.blogspot.com 0.488larsonreport.com 0.020firstdownpolitics.com 0.019neoconservatives.blogspot.com 0.017jmbzine.com 0.017strangedoctrines.typepad.com 0.013millers_time.typepad.com 0.011decision08.blogspot.com 0.010gopandcollege.blogspot.com 0.010charlineandjamie.com 0.008marksteyn.com 0.007blackmanforbush.blogspot.com 0.007reggiescorner.blogspot.com 0.007fearfulsymmetry.blogspot.com 0.006quibbles-n-bits.com 0.006undercaffeinated.com 0.005samizdata.net 0.005pennywit.com 0.005pajamahadin.com 0.005mixtersmix.blogspot.com 0.005stillfighting.blogspot.com 0.005shakespearessister.blogspot.com 0.005jadbury.com 0.005thefulcrum.blogspot.com 0.005watchandwait.blogspot.com 0.005gindy.blogspot.com 0.005cecile.squarespace.com 0.005usliberals.about.com 0.005twentyfirstcenturyrepublican.blogspot.com Seed labels underlined 1. Centrality-sensitive: seeds have different scores and not necessarily the highest 2. Exponential drop- off: much less sure about nodes further away from seeds 3. Classes propagate independently: charlineandjamie.com is both very likely a conservative and a liberal blog (good or bad?) We still don’t really understand it yet.

MRW: Related Work MRW is very much related to – “Local and global consistency” (Zhou et al. 2004) – “Web content categorization using link information” (Gyongyi et al. 2006) – “Graph-based semi-supervised learning as a generative model” (He et al. 2007) Seed preference is related to the field of active learning – Active learning chooses which data point to label next based on previous labels; the labeling is interactive – Seed preference is a batch labeling method Similar formulation, different view RWR ranking as features to SVM Random walk without restart, heuristic stopping Authoritative seed preference a good base line for active learning on network data!

Results How much better is MRW using authoritative seed preference? y-axis: MRW F1 score minus wvRN F1 x-axis: number of seed labels per class The gap between MRW and wvRN narrows with authoritative seeds, but they are still prominent on some datasets with small number of seed labels

A Web-Scale Knowledge Base Read the Web (RtW) project: 31 Build a never-ending system that learns to extract information from unstructured web pages, resulting in a knowledge base of structured information.

Noun Phrase and Context Data As a part of RtW project, two kinds of noun phrase (NP) and context co-occurrence data was produced: – NP-context co-occurrence data – NP-NP co-occurrence data These datasets can be treated as graphs 32

Noun Phrase and Context Data NP-context data: 33 … know that drinking pomegranate juice may not be a bad … NPbefore contextafter context pomegranate juice know that drinking _ _ may not be a bad 3 2 _ is made from _ promotes responsible JELL-O Jagermeister NP-context graph

Noun Phrase and Context Data NP-NP data: 34 … French Constitution Council validates burqa ban … NPcontext French Constitution Council JELL-O Jagermeister NP-NP graph NP burqa ban French Court hot pants veil Context can be used for weighting edges or making a more complex graph

Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 35

Noun Phrase Categorization We propose using MRW (with path folding) on the NP-context data to categorize NPs, given a handful of seed NPs. Challenges: – Large, noisy dataset (10m NPs, 8.6m contexts from 500m web pages). – What’s the right function for NP-NP categorical similarity? – Which learned category assignment should we “promote” to the knowledge base? – How to evaluate it? 36

Noun Phrase Categorization Preliminary experiment: – Small subset of the NP-context data 88k NPs 99k contexts – Find category “city” Start with a handful of seeds – Ground truth set of 2,404 city NPs created using 37