Download presentation
Presentation is loading. Please wait.
Published byPamela Patterson Modified over 8 years ago
1
Unsupervised Streaming Feature Selection in Social Media
Jundong Li1, Xia Hu2, Jiliang Tang3 and Huan Liu1 1Arizona State University 2Texas A&M University 3Yahoo! Labs
2
Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work
3
Social Media Rapid growth of social media provides a platform for people to perform online social activities Massive amounts of high dimensional data are user generated and quickly disseminated It is desirable to reduce the dimensionality of social media data due to curse of dimensionality
4
Feature Selection Feature selection is effective to preparing high-dimensional data by selecting a subset of relevant features for a compact and accurate representation feature selection
5
Feature Selection in Social Media
Traditional feature selection assumes that all features are static and known in advance Features in social media are usually generated dynamically in a streaming fashion Twitter produces more than 500 millions of tweets everyday and a large amount of slang words (features) are continuously being user generated In disaster relief, topics (features) like ``Chile Earthquake” emerge to be hot shortly
6
Streaming Feature Selection
It is more appealing to perform streaming feature selection to capture relevant features timely
7
Challenges, Opportunities and Target
Label information is costly Data not i.i.d Opportunities Link information is abundant and maybe helpful Target Propose an unsupervised streaming feature selection algorithm for social media data No existing unsupervised streaming feature selection algorithms !
8
Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work
9
Problem Statement Given n linked instances, let adjacency M denotes their link information. Assume that features arrive dynamically one each time, at time step t, each instance is associated with a set of streaming features X(t) = {f1, f2, …, ft} we want to select a subset of relevant features at each time step effectively and efficiently by using link information M and content information X(t)
10
Reject existing feature?
Illustration …… …… t t+1 t+i …… t t+1 t+i t t+1 t+i …… …… Accept the new feature? t t+1 t+i t t+1 t+i Reject existing feature? Selected Feature Set
11
Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work
12
Modeling Link Information
Social media users connect due to a variety of reasons such as movie fans, sports enthusiasts, colleagues, etc Users with similar hidden factors are similar Hidden factors are helpful to steer unsupervised streaming feature selection We use mixed membership stochastic blockmodel (MMSB) [Blei+NIPS2009] to extract hidden social factors from link information
13
Modeling Link Information (con’t)
At time step t: Hidden social factors as regression targets L1-norm can be used for feature selection
14
Modeling Content Information
If two users are similar in the original feature space, the two users are also similar in the selected feature space.
15
Optimization Formulation at Time t
By combining network information and content information Decompose into a set of sub-problems
16
Testing New Feature At time step t+1 when the new feature arrives:
Objective function is reduced if the reduction in 1st,3rd,4th term outweighs the increase in the 2nd term Therefore, the condition to accept the new feature is
17
Testing Existing Features
Test existing features when new feature is added When new feature is accepted, we optimize the following w.r.t. current variables, which forces some feature coefficient to be zero Convex optimization problem, we use Broyden-Fletcher-Goldfarb-Shanno (BFGS)
18
Feature Selection by USFS
If the new feature is accepted, we obtain sparse coefficient matrix by solving all sub-problems For each feature j, if any of its k corresponding feature weight is nonzero, the feature is included in the final model, the feature score is defined as Features are ranked in a descending order by their feature scores
19
Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work
20
Questions to Investigate
Q1: How is the quality of selected features by the USFS framework? Q2: How efficient is the proposed USFS framework?
21
Datasets BlogCatalog (social blog directory) Flickr (image sharing website) Assume features arrive in a random order, take {20%,30%,…,90%,100%} of all features
22
Experimental Settings
Evaluation Clustering: K-means Metrics: Accuracy and NMI Baseline batch-mode methods Laplacian Score [He et al. NIPS 2005] SPEC [Zhao and Liu. ICML 2007] NDFS [Li et al. AAAI 2012] LUFS [Tang and Liu, KDD 2012]
23
Performance on Flickr
24
Performance on BlogCatalog
25
Cumulative Running Time
In BlogCatalog, USFS is 7x, 20x, 29x, 76x faster In Flickr, USFS is 5x, 11x, 20x, 75x faster
26
Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work
27
Conclusion Goals: Perform unsupervised streaming feature selection for social media data Solutions: Leverage link information as constraints Stagewise algorithm for streaming features Results: Achieve better feature selection performance in terms of clustering Reduce running time compared with batch-mode methods
28
Future Work In this work, we consider the link information is relative stable compared with dynamic content information, we will investigate streaming feature selection in dynamic networks Streaming features come from different sources, we will investigate how to fuse heterogeneous feature sources for streaming feature selection
29
Questions Acknowledgement: This material is, in part, supported by National Science Foundation (NSF) under grant number IIS Comments and suggestions from DMML members and reviewers are greatly appreciated.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.