Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Data Mining and Machine Learning Lab eTrust: Understanding Trust Evolution in an Online World Jiliang Tang, Huiji Gao and Huan Liu Computer Science and.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.
Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Data Mining and Machine Learning Lab Document Clustering via Matrix Representation Xufei Wang, Jiliang Tang and Huan Liu Arizona State University.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Personalized Search Result Diversification via Structured Learning
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 Feature Selection: Algorithms and Challenges Joint Work with Yanglan Gang, Hao Wang & Xuegang Hu Xindong Wu University of Vermont, USA; Hefei University.
A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.
Data Mining and Machine Learning Lab Network Denoising in Social Media Huiji Gao, Xufei Wang, Jiliang Tang, and Huan Liu Data Mining and Machine Learning.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Lecture 20: Cluster Validation
Wang-Chien Lee i Pervasive Data Access ( i PDA) Group Pennsylvania State University Mining Social Network Big Data Intelligent.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Towards Semantic Embedding in Visual Vocabulary Towards Semantic Embedding in Visual Vocabulary The Twenty-Third IEEE Conference on Computer Vision and.
Supervised Clustering of Label Ranking Data Mihajlo Grbovic, Nemanja Djuric, Slobodan Vucetic {mihajlo.grbovic, nemanja.djuric,
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Gene-Markers Representation for Microarray Data Integration Boston, October 2007 Elena Baralis, Elisa Ficarra, Alessandro Fiori, Enrico Macii Department.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Mining Social Media Data Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab Nov.
Network Community Behavior to Infer Human Activities.
Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Adaptive Multi-view Clustering via Cross Trace Lasso
Unsupervised Streaming Feature Selection in Social Media
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Ultra-high dimensional feature selection Yun Li
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions Lei Tang, Huan Liu CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/02/01.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.
Data Mining and Text Mining. The Standard Data Mining process.
CSE 4705 Artificial Intelligence
Machine Learning Basics
Learning with information of features
A survey of network anomaly detection techniques
Fake News Detection - Social Article Fusion
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.
Presentation transcript:

Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering Arizona State University August 12-16, 2012 KDD2012

Social Media The expansive use of social media generates massive data in an unprecedented rate million tweets per day - 3,000 photos in Flickr per minute -153 million blogs posted per year

High-dimensional Social Media Data Social Media Data can be high-dimensional –Photos –Video stream –Tweets Presenting new challenges –Massive and noisy data –Curse of dimensionality

Feature Selection Feature selection is an effective way of preparing high-dimensional data for efficient data mining. What is new for feature selection of social media data ?

Representation of Linked Data …

Challenges for Feature Selection Unlabeled data - No explicit definition of feature relevancy - Without additional constraints, many subsets of features could be equally good Linked data - Not independent and identically distributed

Opportunities for Feature Selection Social media data provides link information - Correlation between data instances Social media data provides extra constraints - Enabling us to exploring the use of social theories

Problem Statement Given n linked data instances, its attribute-value representation X, its link representation R, we want to select a subset of features by exploiting both X and R for these n data instances in an unsupervised scenario.

Supervised and Unsupervised Feature Selection A unified view –Selecting features that are consistent with some constraints for either supervised or unsupervised feature selection –Class labels are sort of targets as a constraint Two problems for unsupervised feature selection - What are the targets? - Where can we find constraints?

Our Framework: LUFS

The Target for LUFS

The Constraints for LUFS

Pseudo-class Label

Social Dimension for Link Information Social Dimension captures group behaviors of linked Instances –Instances in different social dimensions are disimilar –Instances within a social dimension are similar Example:

Social Dimension Regularization Within, between, and total social dimension scatter matrices, Instances are similar within social dimensions while dissimilar between social dimensions.

Constraint from Attribute-Value Data Similar instances in terms of their contents are more likely to share similar topics,

An Optimization Problem for LUFS

The Optimization Problem for LUFS

LUFS after Two Relaxations Spectral Relaxation on Y - Social Dimension Regularization: W = diag(s)W, and adding 2,1-norm on W

Evaluating LUFS Datasets and experiment setting What is the performance of LUFS comparing to state-of-the art baseline methods? Why does LUFS work?

Evaluating LUFS Datasets and experiment setting What is the performance of LUFS comparing to state-of-the art baseline methods? Why does LUFS work?

Data and Characteristics BlogCatalog Flickr

asets.html

Experiment Settings Metrics - Clustering: Accuracy and NMI - K-Means Baseline methods - UDFS - SPEC - Laplacian Score

Evaluating LUFS Datasets and experiment setting What is the performance of LUFS comparing to state-of-the art baseline methods? Why does LUFS work?

Results on Flickr

Results on BlogCatalog

Evaluating LUFS Datasets and experiment setting What is the performance of LUFS comparing to state-of-the art baseline methods? Why does LUFS work?

Probing Further: Why Social Dimensions Work Social Dimensions Random Groups ……. Link Information Social Dimension Extraction Random Assignment

Results in Flickr

Future Work Further exploration of link information Noise and incomplete social media data Other sources: multi-view sources The strength of social ties ( strong and weak ties mixed)

/projects/NSF12/ More Information?

Questions Acknowledgments: This work is, in part, sponsored by National Science Foundation via a grant (# ). Comments and suggestions from DMML members and reviewers are greatly appreciated.