Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and Technology Tsinghua University, China
An Example
Instances Correlation
An Example Instances Correlation ? ? ? ? ? ? Classify each instance into {+1, -1}
An Example Instances Correlation +1 ? +1 ? ?
An Example Instances Correlation +1 ? +1 ? ? Query for label
An Example Instances Correlation +1 ? +1 ?
Problem: Active Learning for Networked Data Instances Correalation +1 ? +1 ? ? Challenge It is expensive to query for labels! Questions Which instances should we select to query? How many instances do we need to query, for an accurate classifier?
Challenges Active Learning for Networked Data How to leverage network correlation among instances? How to query in a batch mode?
Batch Mode Active Learning for Networked Data Given a graph Unlabeled instances Features Matrix Labeled instances Labels of labeled instances Edges Our objective is Subject to A subset of unlabeled instances The utility function Labeling budget
Factor Graph Model ? ? ? ? ? ? Variable Node Factor Node
Factor Graph Model The joint probability Local factor function Edge factor function Log likelihood of labeled instances
Factor Graph Model Learning Gradient descent Calculate the expectation: Loopy Belief Propagation (LBP) Message from variable to factor Message from factor to variable
Question: How to select instances from Factor graph for active learning?
Basic principle: Maximize the Ripple Effects ? ? ? ? ? ?
Maximize the Ripple Effects ? ? ? +1 ? ? Labeling information is propagated
Maximize the Ripple Effects ? ? ? +1 ? ? Labeling information is propagated
Maximize the Ripple Effects ? ? ? +1 ? ? Labeling information is propagated Statistical bias is propagated How to model the propagation process in a unlabeled network?
Diffusion Model Linear Threshold Model Progressive Diffusion Model Non-Progressive Diffusion Model Linear Threshold
Maximize the Ripple Effects ? ? ? +1 ? ? Labeling information is propagated Statistical bias is propagated Will it be dominated by labeling information (active) or statistical bias (inactive)? Based on non-progressive diffusion model Maximize the number of activated instances in the end We aim to activate the most uncertain instances!
Instantiate the Problem Active Learning Based on Non-Progressive Diffusion Model, The number of activated instances With constraints Initially activate all queried instances We activate the most uncertain instances Based on the non-progressive diffusion
Reduce the Problem The original problem The reduced problem Constraints are inherited. Reduction procedure
Algorithm The reduced problem The key idea
Algorithm
Theoretical Analysis Convergence Lemma 1 The algorithm will converge within time. Correctness Approximation Ratio
Experiments Datasets #Variable node#Factor node Coauthor6,09624,468 Slashdot3701,686 Mobile Enron Comparison Methods Batch Mode Active Learning (BMAL), proposed by Shi et al. Influence Maximization Selection (IMS), proposed by Zhuang et al. Maximum Uncertainty (MU) Random (RAN) Max Coverage (MaxCo), our method
Experiments Performance
Related Work Active Learning for Networked Data Actively learning to infer social ties H. Zhuang, J. Tang, W. Tang, T. Lou, A. Chin and X. Wang Batch mode active learning for networked data L. Shi, Y. Zhao and J. Tang Towards active learning on graphs: an error bound minimization approach Q. Gu and J. Han Integreation of active learing in a collaborative crf O. Martinez and G. Tsechpenakis Diffusion Model On the non-progressive spread of influence through social networks M. Fazli, M. Ghodsi, J. Habibi, P. J. Khalilabadi, V. Mirrokni and S. S. Sadeghabad Maximizing the spread of influence through a social network D. Kempe, J. Kleinberg and E. Tardos
Conclusion Connect active learning for networked data to non-progressive diffusion model, and precisely formulate the problem Propose an algorithm to solve the problem Theoretically guarantee the convergence, correctness and approximation ratio of the algorithm Empirically evaluate the performance of the algorithm on four datasets of different genres
Future work Consider active learning for networked data in a streaming setting, where data distribution and network structure are changing over time
About Me Zhilin Yang 3 rd year undergraduate at Tsinghua Univ. Applying for PhD programs this year Data Mining & Machine Learning
Thanks!