Sparsification and Sampling of Networks for Collective Classification Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George Mason University Fairfax, VA, USA
Outline Introduction Motivation Related Work Proposed Methods Results Conclusion and Future Work
Sparsification and Sampling of Networks for Collective Classification Given: Partially labeled weighted network Node attributes for all the nodes Goal: Predict the labels of unlabeled nodes in network Points to consider: Networks with fewer edges can be formed using sparsification algorithms The selection of labeled nodes for training, influences the overall accuracy – research on sampling algorithms for collective classification
Sample Input Network (partially labeled)
Relational Network Sparsification Study of networks involves Relational Learning Relational network consists of nodes representing entities and edges representing pairwise interactions Edges can be weighted / unweighted Weights represents similarity between pair of nodes Edges with low weights don’t carry much information – we can remove them based on some criteria! Sparsify the network without losing much information
Example: Network with noisy edges
Example: Noise edges removed!
Importance of Sparsification in Network Problems: Data analysis is time consuming Noisy edges can not convey fruitful information in relational data Solutions: Identify and remove the noisy edges Make sure to remove noisy edges only, and not the others! Classify the unlabeled nodes in sparsified network using Collective Classification and compare results with unsparsified network
Graph sparsification methods for clustering (GS) Global Graph Sparsification (Satuluri et al. SIGMOD 2011) (LS) Local Graph Sparsification (Satuluri et al. SIGMOD 2011) Drawbacks: Methods designed for fast clustering, not suitable for classification All edges treated equally Sparsified network becomes more disconnected
Global Graph Sparsification (Satuluri et al. SIGMOD 2011) Disconnected component Singleton nodes
Local Graph Sparsification (Satuluri et al. SIGMOD 2011) Removal of this edge disconnects the graph In addition to edges marked red, some more edges marked blue were removed! The edges removed with this method might not be a superset of the edges removed by global sparsification method.
Adaptive Global Sparsifier (Saha et al. SBP 2013) Aims to address the drawbacks of LS and GS Doesn’t remove an edge if the removal is going to make the graph more disconnected Note: This method is less aggressive in removing edges compared to local and global sparsification algorithms by Satuluri et al.
Adaptive Global Sparsifier Keep the edges with top similarity scores (here, score >= 0.3)
Adaptive Global Sparsifier (contd.) Removing red edges doesn’t increase the number of connected components Mauve colored edges have low similarity score but we put them back to avoid disconnect components
Collective Classification in Networks Input: A graph G = (V,E) with given percentage of labeled nodes for training, node features for all the nodes Output: Predicted labels of the test nodes Model: Relational features and node features are used for training local classifier using labeled nodes Test nodes labels are initialized with labels predicted by local classifier using node attributes Inference through iterative classification of test nodes until convergence criterion reached Network of researchers SW DM AI Bio ML ?
Datasets & Experiments Cora citation network, directed graph of 2708 research papers belonging to either one of 7 research areas (classes) in Computer Science (data downloaded from http://www.cs.umd.edu/projects/linqs/projects/lbc/index.ht ml ) DBLP co-authorship network among 5602 researchers in 6 different areas of computer science (raw data downloaded from http://arnetminer.org and processed) Number of edges acquired with different sparsification algorithms with sparsification ratio s=70%: Dataset Total edges in network Adaptive Global Sparsifier Global Sparsifier Local Sparsifier Cora 5429 3850 3800 2429 DBLP 17265 12251 12086 6859
Experiments (contd.) Weighted Vote Relational Neighbor (wvRN) is used as the base collective classification algorithm (Macskassy et al. JMLR 2007) Baseline methods: Global Sparsification Algorithm (GS) and Local Sparsification Algorithm (LS) (Satuluri et al. SIGMOD 2011) Performance metric: Accuracy of Classification
Results Cora DBLP
Sampling for Collective Classification A good sample from a data should inherit all the characteristics Forest fire sampling, node sampling, edge sampling with induction (Ahmed et al. ICWSM 2012) We argue: “goodness” of a sample is defined based on the problem we want to solve Rationale: Choosing samples for training should make sure that each test node is connected to at least one training node Why? To facilitate collective classification by ensuring test nodes can have useful relational features computed from training nodes!
Adaptive Forest Fire Sampling Modified version of Forest Fire Sampling (Leskovec et al. KDD 2005) Selects a random node as “seed node” to start and marks as “visited” “Adaptive” because it randomly selects only a certain percentage of edges incident on a visited node, to propagate along the network and mark the nodes on the other end of edges as “visited” Maintains a queue of unvisited nodes as propagation occurs in the network Ensures that each test node is connected to at least one training node
Adaptive Forest Fire Sampling of network with 19 nodes Test nodes Test nodes
Experiments Baseline classifiers used for comparing Random Sampling with Adaptive Forest Fire sampling: wvRN (Macskassy et al. JMLR 2007) Multi-class SVM (Krammer and Singer JMLR 2001, Tsochantaridis et al. ICML 2004) RankNN for single labeled data (Saha et al. ICMLA 2012)
Results (Cora citation network) Random Sampling Adaptive Forest Fire Sampling
Conclusions Introduced a sparsification method for collective classification of network datasets without losing much information and comparable accuracies Introduced a network sampling algorithm for facilitating collective classification These algorithms work on single labeled networks, in future we would extend these approach to treat multi-labeled networks as well These algorithms are designed for static networks, an interesting work would be to formulate sampling methods for networks that change over time
Thank You!