Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unsupervised Streaming Feature Selection in Social Media

Similar presentations


Presentation on theme: "Unsupervised Streaming Feature Selection in Social Media"— Presentation transcript:

1 Unsupervised Streaming Feature Selection in Social Media
Jundong Li1, Xia Hu2, Jiliang Tang3 and Huan Liu1 1Arizona State University 2Texas A&M University 3Yahoo! Labs

2 Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

3 Social Media Rapid growth of social media provides a platform for people to perform online social activities Massive amounts of high dimensional data are user generated and quickly disseminated It is desirable to reduce the dimensionality of social media data due to curse of dimensionality

4 Feature Selection Feature selection is effective to preparing high-dimensional data by selecting a subset of relevant features for a compact and accurate representation feature selection

5 Feature Selection in Social Media
Traditional feature selection assumes that all features are static and known in advance Features in social media are usually generated dynamically in a streaming fashion Twitter produces more than 500 millions of tweets everyday and a large amount of slang words (features) are continuously being user generated In disaster relief, topics (features) like ``Chile Earthquake” emerge to be hot shortly

6 Streaming Feature Selection
It is more appealing to perform streaming feature selection to capture relevant features timely

7 Challenges, Opportunities and Target
Label information is costly Data not i.i.d Opportunities Link information is abundant and maybe helpful Target Propose an unsupervised streaming feature selection algorithm for social media data No existing unsupervised streaming feature selection algorithms !

8 Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

9 Problem Statement Given n linked instances, let adjacency M denotes their link information. Assume that features arrive dynamically one each time, at time step t, each instance is associated with a set of streaming features X(t) = {f1, f2, …, ft} we want to select a subset of relevant features at each time step effectively and efficiently by using link information M and content information X(t)

10 Reject existing feature?
Illustration …… …… t t+1 t+i …… t t+1 t+i t t+1 t+i …… …… Accept the new feature? t t+1 t+i t t+1 t+i Reject existing feature? Selected Feature Set

11 Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

12 Modeling Link Information
Social media users connect due to a variety of reasons such as movie fans, sports enthusiasts, colleagues, etc Users with similar hidden factors are similar Hidden factors are helpful to steer unsupervised streaming feature selection We use mixed membership stochastic blockmodel (MMSB) [Blei+NIPS2009] to extract hidden social factors from link information

13 Modeling Link Information (con’t)
At time step t: Hidden social factors as regression targets L1-norm can be used for feature selection

14 Modeling Content Information
If two users are similar in the original feature space, the two users are also similar in the selected feature space.

15 Optimization Formulation at Time t
By combining network information and content information Decompose into a set of sub-problems

16 Testing New Feature At time step t+1 when the new feature arrives:
Objective function is reduced if the reduction in 1st,3rd,4th term outweighs the increase in the 2nd term Therefore, the condition to accept the new feature is

17 Testing Existing Features
Test existing features when new feature is added When new feature is accepted, we optimize the following w.r.t. current variables, which forces some feature coefficient to be zero Convex optimization problem, we use Broyden-Fletcher-Goldfarb-Shanno (BFGS)

18 Feature Selection by USFS
If the new feature is accepted, we obtain sparse coefficient matrix by solving all sub-problems For each feature j, if any of its k corresponding feature weight is nonzero, the feature is included in the final model, the feature score is defined as Features are ranked in a descending order by their feature scores

19 Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

20 Questions to Investigate
Q1: How is the quality of selected features by the USFS framework? Q2: How efficient is the proposed USFS framework?

21 Datasets BlogCatalog (social blog directory) Flickr (image sharing website) Assume features arrive in a random order, take {20%,30%,…,90%,100%} of all features

22 Experimental Settings
Evaluation Clustering: K-means Metrics: Accuracy and NMI Baseline batch-mode methods Laplacian Score [He et al. NIPS 2005] SPEC [Zhao and Liu. ICML 2007] NDFS [Li et al. AAAI 2012] LUFS [Tang and Liu, KDD 2012]

23 Performance on Flickr

24 Performance on BlogCatalog

25 Cumulative Running Time
In BlogCatalog, USFS is 7x, 20x, 29x, 76x faster In Flickr, USFS is 5x, 11x, 20x, 75x faster

26 Outline Background and Motivation Problem Statement Proposed USFS Framework Experimental Results Conclusions and Future Work

27 Conclusion Goals: Perform unsupervised streaming feature selection for social media data Solutions: Leverage link information as constraints Stagewise algorithm for streaming features Results: Achieve better feature selection performance in terms of clustering Reduce running time compared with batch-mode methods

28 Future Work In this work, we consider the link information is relative stable compared with dynamic content information, we will investigate streaming feature selection in dynamic networks Streaming features come from different sources, we will investigate how to fuse heterogeneous feature sources for streaming feature selection

29 Questions Acknowledgement: This material is, in part, supported by National Science Foundation (NSF) under grant number IIS Comments and suggestions from DMML members and reviewers are greatly appreciated.


Download ppt "Unsupervised Streaming Feature Selection in Social Media"

Similar presentations


Ads by Google