Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.

Slides:



Advertisements
Similar presentations
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
On-line learning and Boosting
Analysis and Modeling of Social Networks Foudalis Ilias.
Support Vector Machines and Margins
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.
K nearest neighbor and Rocchio algorithm
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Computing Trust in Social Networks
Three kinds of learning
Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS910: Foundations of Data Analytics Graham Cormode Social Network Analysis.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Ensembles of Classifiers Evgueni Smirnov
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
1 Recommender Systems Collaborative Filtering & Content-Based Recommending.
Predicting Positive and Negative Links in Online Social Networks
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.
Ensemble Methods: Bagging and Boosting
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Algorithmic Detection of Semantic Similarity WWW 2005.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
1 Collaborative Filtering & Content-Based Recommending CS 290N. T. Yang Slides based on R. Mooney at UT Austin.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
CS910: Foundations of Data Analytics Graham Cormode Social Network Analysis.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
User Modeling and Recommender Systems: recommendation algorithms
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS Machine Learning Instance Based Learning (Adapted from various sources)
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
KNN & Naïve Bayes Hongning Wang
Graph-based WSD の続き DMLA /7/10 小町守.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
Instance Based Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Relation Extraction CSCI-GA.2591
Overview of Machine Learning
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Presentation transcript:

Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode

Blogs as Multigraphs Many interesting new data sources are best modelled as multigraphs, with multiple attributes and link types. “Blogs” are an important emerging example of such data: Intersect with web, , chat data, social networks React rapidly to major news, defining opinion and identifying articles of interest Raise problems of trustworthiness, finding leaders, classifying for expertise and bias We study labeling problems on these large multigraphs

author tags timestamp links headline Static links: “blogroll” reader comments Commenter id and timestamp text profile data

Personal info A/S/L: Age, Sex, Location Links to friends on same host Free-text info Instant messenger and ids

Learning Labels on Multigraphs ? Blog Blog Entry Webpage  Blogs, blog links, web links, comments etc. implicitly define a (massive) multigraph  We focus on problems of learning labels  Our focus is on properties of the blog author such as age  As with all supervised learning, cannot always trust the training data… apparently some people lie about their age

Prior Work on (Multi)graph Learning  Relational learning: classify objects represented by Relational Database (see work by Getoor et al)  Typically builds complex models e.g. Relational Markov Networks on relatively small examples (few thousand nodes)  Our problem is also an instance of semi-supervised learning (input is mix of labelled and unlabeled examples)  Several works apply matrix decomposition, does not scale well to massive (multi)graphs  Some work on similar labelling problems on web graph in addition to text (Chakrabarti et al., 1998)

Simple Learning on Graphs Labels are computed iteratively using weighted voting by neighbors Label is inferred by searching for similar neighborhoods of labeled nodes Label is computed from the votes by its neighbors Local: Iterative Global: Nearest Neighbor Similar Hypothesis: Nodes point to other nodes with similar labels (homophily) Hypothesis: Nodes with similar neighborhoods have similar labels (co-citation regularity)

? w Nearest Neighbor: Set Similarity Iterative: Pseudo Labels ? w w2 19 w3 Extend Learning to Multigraphs Webpages assigned a pseudo label, based on votes by its neighbors Augment distance with similarity between sets of neighboring web-nodes Hypothesis: Web pages link similar communities of bloggers Hypothesis: Distance computation is improved with additional features

Implementation Issues Preliminary experiments guided choice of settings: –Choice of similarity function for NN classifier: used correlation coefficient between vectors of adjacent labels –Smoothed feature vector with triangular kernel because of continuity of ages –In multigraph case with additional features, extended by blending with Jacard coefficient of set similarity of features –Iterative algorithm allocates label based on majority voting Experimented with variety of edge combinations: Friends only, blog only, blog+friends, blog+web

300K profiles crawled 124K (41%) labeled 200K blog nodes 404K blog links 289K web nodes 1089K web links Median: 2 blog links Median: 4 web links Data Collection Summary BB W 780K profiles crawled 500K (64%) labeled 535K blog nodes 3000K blog links 74K web nodes 895K web links Median: 5 blog links Median: 2 web links 400K profiles crawled 50K (12.5%) labeled 41K blog nodes 190K blog links 331K web nodes 997K web links Median: 4 blog links Median: 3 web links Most popular weblinks 1. news.google.com 2. picasa.google.com 3. en.wikipedia.org Most popular weblinks 1. maps.google.com photobucket.com quizilla.com Most popular weblinks 1. members.msn.com 2. wwp.icq.com 3. edit.yahoo.com ≈50GB of data collected

Accuracy on Age Label Similar results on age for both methods, some data sets are “easier” than others, due to density and connectivity Local algorithm takes few seconds to assign labels, NN takes tens of minutes (due to exhaustive comparisons)

Multigraph Labeling for Age Adding web links and using pseudo labels does not significantly change accuracy, but increases coverage Assigned age reflects webpage, e.g. bands slipknot (17) vs. Radiohead (28), but also demographics of blog network

Learning Location Labels  Local algorithm predicts country and continent with high (80%+) accuracy over all data sets, validating hypothesis  Errors come from over-representing common labels: N. America has high recall, low precision, Africa vice-versa.

Conclusions Analyzed performance of simple classifiers for blog data using link and label information only –Richness of setting leads to many details: choice of distance, smoothing and voting functions, etc. –Links alone still hold a lot of information: 80% accuracy, better than naïve use of standard classifiers Simple models are quite limited, do not extend easily –Work better for some labels, rely on hypotheses –Open to apply and scale richer models (Relational Markov Networks) to blogs Need to understand benefit of additional attributes –in our expts, extra features did not seem to help