Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Brief introduction on Logistic Regression
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Tutorial at ICCV (Barcelona, Spain, November 2011)
Finding your friends and following them to where you are by Adam Sadilek, Henry Kautz, Jeffrey P. Bigham Presented by Guang Ling 1.
Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Probabilistic Framework for Semi-Supervised Clustering
IJCAI Wei Zhang, 1 Xiangyang Xue, 2 Jianping Fan, 1 Xiaojing Huang, 1 Bin Wu, 1 Mingjie Liu 1 Fudan University, China; 2 UNCC, USA {weizh,
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Ensemble Learning: An Introduction
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.
Chapter 7 Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction to Sampling Distributions Introduction to.
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
1 1 Slide © 2005 Thomson/South-Western Chapter 7, Part A Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution of Introduction.
Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #5 Jose M. Cruz Assistant Professor.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 7 Sampling and Sampling Distributions Sampling Distribution of Sampling Distribution.
Edoardo PIZZOLI, Chiara PICCINI NTTS New Techniques and Technologies for Statistics SPATIAL DATA REPRESENTATION: AN IMPROVEMENT OF STATISTICAL DISSEMINATION.
Statistical Estimation of Word Acquisition with Application to Readability Prediction Proceedings of the 2009 Conference on Empirical Methods in Natural.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Unsupervised Streaming Feature Selection in Social Media
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Active Learning for Network Intrusion Detection ACM CCS 2009 Nico Görnitz, Technische Universität Berlin Marius Kloft, Technische Universität Berlin Konrad.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Chapter 7. Classification and Prediction
Data Driven Resource Allocation for Distributed Learning
Greedy & Heuristic algorithms in Influence Maximization
Dynamic Graph Partitioning Algorithm
Analyzing Redistribution Matrix with Wavelet
Multimodal Learning with Deep Boltzmann Machines
CHAPTER 12 More About Regression
Inference for Geostatistical Data: Kriging for Spatial Interpolation
Econ 3790: Business and Economics Statistics
Structural influence:
CHAPTER 12 More About Regression
Example: Academic Search
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
GANG: Detecting Fraudulent Users in OSNs
Parametric Methods Berlin Chen, 2005 References:
CHAPTER 12 More About Regression
Learning from Data Streams
Primal Sparse Max-Margin Markov Networks
Random Neural Network Texture Model
Presentation transcript:

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University

Introduction Mining streaming data becomes an important topic.  Challenge 1: the expensiveness of labeled data Related work: active learning for streaming data [28, 6, 5, 29]  Challenge 2: network correlation between data instances Related work: active learning for networked data [23, 25, 3, 4, 10, 27, 8, 22]  A novel problem: active learning for streaming networked data To deal with both challenges 1 & 2.

Problem Formulation Streaming Networked Data When a new instances arrives, new edges are added to connect the new instance and existing instances.

Problem Formulation Active Learning for Streaming Networked Data Our output is a data stream. At any time, we maintain a classifier based on arrived instances. At any time, we go through the following steps: 1.Predict the label for based on 2.Decide whether to query for the true label 3.Update the model to be Our goal is to use a small number of queries, to control (minimize) the accumulative error rate.

Challenges Concept drift. The distribution of input data and network structure change over time as we are handling streaming data. How to adapt to concept drift? Network correlation. In the networked data, there is correlation among instances. How to model the correlation in the streaming data? Online query. We must decide whether to query an instance at the time of its appearance, which makes it infeasible to optimize a global objective function. How to develop online query algorithms?

Modeling Networked Data Time-Dependent Network At any time, we can construct a time-dependent network based on all the arrived instances before and at time. A matrix, with an element indicating the feature of instance The set of all edges between instances. A set of labels of instances that we have already actively queried before. A set of unknown labels for all the other instances.

Modeling Networked Data The Basic Model: Markov Random Field Given the graph, we can write the energy as True labels of queried instances The energy defined for instance The energy associated with the edge

Modeling Networked Data Model Inference We try to assign labels to such that we can minimize the following energy Usually intractable to directly solve the above problem. Apply dual decomposition [17] to decompose the original problems into a set of tractable subproblems. The dual optimization problem is as follows: Local optimization Dual variables We can solve the above objective function with projected subgradient [13]. Subject to Global constraint

Modeling Networked Data Model Learning Applying max margin learning paradigm, the objective function for parameter learning is written as where A slack variable The margin between two configurations Dissimilarity measure between two configurations

Modeling Networked Data Model Learning Applying dual decomposition, we have the dual optimization objective function as follows: Dual variables The optimization problem becomes We can solve the above problem with projected subgradient method.

Streaming Active Query Structural Variability Intuition: control the gap between the energy of the inferred configuration and that of any other possible configuration. We define the structural variability as follows: The energy of the inferred configuration The energy of any other configuration

Streaming Active Query Properties of Structural Variability 1.Monotonicity. Suppose and are two sets of instance labels. Given, if, then we have 2.Normality. If, we have The structural variability will not increase as we label more instances in the MRF. If we label all instances in the graph, we incur no structural variability at all.

Streaming Active Query Properties of Structural Variability 3. Centrality Under certain circumstances, minimizing structural variability leads to querying instances with high network centrality.

Streaming Active Query Decrease Function We define a decrease function for each instance Structural variability before querying y_i Structural variability after querying y_i The second term is in general intractable. We estimate the second term by expectation The true probability We approximate the true probability by

Streaming Active Query Decrease Function We define a decrease function for each instance Structural variability before querying y_i Structural variability after querying y_i The first term can be computed by dual decomposition. The dual problem is

Streaming Active Query The algorithm Given the constant threshold, we query if and only if Analysis

Enhancement by Network Sampling Basic Idea Maintain an instance reservoir of a fixed size, and update the reservoir sequentially on the arrival of streaming data. Which instances to discard when the size of the reservoir is exceeded? Simply discard early-arrived instances may deteriorate the network correlation. Instead, we consider the loss of discarding an instance in two dimensions: 1.Spatial dimension: the loss in a snapshot graph based on network correlation deterioration 2.Temporal dimension: integrating the spatial loss over time

Enhancement by Network Sampling Spatial Dimension Use dual variables as indicators of network correlation. The violation for instance can be written as Then the spatial loss is Intuition 1.Dual variables can be viewed as the message sent from the edge factor to each instance 2.The more serious the optimization constraint is violated, the more we need to adjust the dual variables Measure how much the optimization constraint is violated after remove the instance

Enhancement by Network Sampling Temporal Dimension The streaming network is evolving dynamically, we should not only consider the current spatial loss. To proceed, we assume that for a given instance, dual variables of its neighbors have a distribution with an expectation and that the dual variables are independent. We obtain an unbiased estimator for Integrating the spatial loss over time, we obtain Suppose edges are added according to preferential attachment [2], the loss function is written as

Enhancement by Network Sampling The algorithm At time, we receive a new instance from the data stream, and update the graph. If the number of instances exceed the reservoir size, we remove the instance with the least loss function and its associated edges from the MRF model. Interpretation The first term  Enables us to leverage the spatial loss function in the network.  Instances that are important to the current model are also likely to remain important in the successive time stamps. The second term  Instances with larger are reserved.  Our sampling procedure implicitly handled concept drift, because later-arrived instances are more relevant to the current concept [28].

The Framework Step 1: MRF-based inference Step 2: Streaming active query Step 3: MRF-based parameter update Step 4: Network sampling

Experiments Datasets  Weibo [26] is the most popular microblogging service in China.  View the retweeting flow as a data stream.  Predict whether a user will retweet a microblog.  3 types of edge factors: friends; sharing the same user; sharing the same tweet  Slashdot is an online social network for sharing technology related news.  Treat each follow relationship as an instance.  Predict “friends” or “foes”.  3 types of edge factors: appearing in the same post; sharing the same follower; sharing the same followee.  IMDB is an online database of information related to movies and TVs.  Each movie is treated as an instance.  Classify movies into categories such as romance and animation.  Edges indicate common-star relationships.  ArnetMiner [19] is an academic social network.  Each publication is treated as an instance.  Classify publications into categories such as machine learning and data mining.  Edges indicate co-author relationships.

Experiments Datasets

Experiments Active Query Performance Suppress the network sampling method by setting the reservoir size to be infinite. Compare different streaming active query algorithms. (F1 score v.s. labeling rate)

Experiments Concept Drift First row: data stream Second row: shuffled data (F1 score v.s. data chunk index) 1.Clearly found some evidence about the existence of concept drift 2.Our algorithm is robust because it not only better adapts to concept drift (upper row) but also performs well without concept drift (lower row).

Experiments Streaming Network Sampling F1 v.s. labeling rate (with varied reservoir size) Speedup Performance (Running time v.s. reservoir size) The decrease of the reservoir size leads to minor decrease in performance but significantly less running time.

Experiments Streaming Network Sampling We fix the labeling rate, and compare different streaming network sampling algorithms with varied reservoir sizes.

Experiments Performance of Hybrid Approach We fix the labeling rate and reservoir size, and compare different combinations of active query algorithms and network sampling algorithms.

Conclusions  Formulate a novel problem of active learning for streaming networked data  Propose a streaming active query algorithm based on the structural variability  Design a network sampling algorithm to handle large volume of streaming data  Empirically evaluate the effectiveness and efficiency of our algorithm

Thanks Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University