Big Data + Deep Learn = A Universal Solution?

Big Data + Deep Learn = A Universal Solution?
Huan Liu This photo shows visitors, and former and current students in the lab. Some graduated. This talk is a discussion of various topics: AI, Social Media, Big Data, Social Media Intelligence I am very fortunate to have many diligent and intelligent students to work with me since I moved to ASU. I am truly grateful to them. My lab and future students benefit immensely from their original work, fame and success.

Big Data and Deep Learning
Big data is ubiquitous as computing technology advances The success of KDD shows that ‘raw oil’ can be turned into valuable ‘products’ Numerous impressive results of deep learning revive neural networks to a new height for machine learning Are we close to having a universal solution? Data is the new oil Gordon Moore’s law

Big Data + Deep Learning = A Universal Solution?
Yes, if we had sufficiently big data Do we often have enough data? Social media data is obviously big We use social media data in this discussion

Social Media Data Twitter Facebook Instagram 300 million users
500 million tweets / day 1% (5 million) released for research Facebook 2 billion users 422 million updates / day 196 million photos / day Instagram 700 million users 80 million photos / day Facebook Degree Distribution Thanks to Dr. Fred Morstatter Social media data is mainly user-generated with geo-spatial, pictorial, temporal, and social information. SM contains a lot of newly available information Instagram Users over Time

A Key to Success in Search of a Universal Solution
Make “Big” Data Bigger

Making “Big” Data Bigger
What is big data? A conventional answer is 4Vs A practitioner’s answer is more nuanced ‘Big’ data can be actually little or thin For machine learning or data mining to work, the more data, the better Make little data bigger Make thin data thicker 4Vs: Volume, Velocity, Variety, Veracity; and Value, Vulnerability A correct answer or philosopher’s answer is it depends, but depending on what? Use curse of dimensionality to increase data amount

Curse of Dimensionality: Required Samples
Data sparsity becomes exponentially worse as feature dimensionality increases Conventional distance metric becomes ineffective as far and near neighbors have similar distances 3 samples per unit region 1 sample per region 1/3 sample per region

Relevant, Redundant and Irrelevant Features
Feature selection retains relevant features for learning and removes redundant or irrelevant ones For a binary classification task below, f1 is relevant, f2 is redundant given f1, and f3 is irrelevant Colors show two classes

Feature Selection Feature selection selects an ‘optimal’ subset of relevant features from the original high-dimensional data given a certain criterion feature selection

Feature Selection and scikit-feature
Feature selection can make data `bigger’ Assuming all binary attribute values in our toy example Before FS, 5/210 = 5/1024, after FS, 5/23 = 5/8 Does FS always work? Yes, for most high-d data Where can we find it? scikit-feature, an open- source repository in Python 5/1024 < 0.5%, 5/8 > 50%

Making Thin Data Thicker
Most people like many of us are in the long tail Our data is often thin or sparse With little data, machine learning is powerless Social media data offers new opportunities Multiple facets: posts, profile, linked information Multiple platforms that offer different functions Two case studies Feature selection using social network information Connecting users across more than one social media site For the first case, we use addition facet of data (or data variety) to help enrich data; for the second case, we use informtation from multiple sites to make data thicker

Use Link Information for Data Thickening
Where can we find additional information for feature selection Social media data contains various types of data Link information is additional Other sources such as sentiment, like, etc. Are there theories to guide us in using link info? Social influence Homophily Extracting distinctive relations from linked data for feature selection Having additional information does not mean that we have to use it

Representation for Social Media Data
And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first. Social Context

Relation Extraction CoPost CoFollowing CoFollowed Following
CoPost – posts belong to a user, CoFollowing - users follow the same person, CoFollowed - followed by the same people, Following – a user follows another user CoPost CoFollowing CoFollowed Following

Evaluation Results on Digg Data
And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first.

Summary LinkedFS is evaluated under varied circumstances to understand how it works Link information can help feature selection for social media data Unlabeled data is more often in social media, unsupervised learning is more sensible, but also more challenging An unsupervised method is showcased in our KDD12 paper following social correlation theories Jiliang Tang and Huan Liu. `` Unsupervised Feature Selection for Linked Social Media Data'', the Eighteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2012. Jiliang Tang, Huan Liu. ``Feature Selection with Linked Data in Social Media'', SIAM International Conference on Data Mining, 2012.

Gather More Data with Little Data
Collectively, social media data is indeed big For an individual, however, the data is small How much activity data do we generate daily? How many posts did we post this week? How many friends do we have? When “big” social media data isn’t big, Searching for more data with little data We use different social media services for varied purposes LinkedIn, Facebook, Twitter, Instagram, YouTube, … Activity data: comment, like, share, retweet; 4Vs

An Example Can we connect individuals across sites? Reza Zafarani
- Little data about an individual + Many social media sites - Partial Information + Complementary Information > Better User Profiles LinkedIn Twitter Age Location Education N/A Phoenix Area ASU (2014) N/A Tempe, AZ ASU - How hard can it be to identify an individual across sites? - Privacy Experts Claim Advertisers Know a lot about People - Can they stop showing you the same repetitive ads across sites? Connectivity is not available Consistency in Information Availability Can we connect individuals across sites? Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", the Nineteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2013), August , Chicago, Illinois.

Searching for More Data with Little Data
Each social media site can have varied amount of user information Which information definitely exists for all sites? Usernames But, a user’s usernames on different sites can be different Our work is to connect the information of the same user provided across sites Why is it not an entity resolution problem? We define our problem as an identification problem and we solve it via machine learning

Our Behavior Generates Information Redundancy
Information shared across sites provides a behavioral fingerprint How to capture and use differentiable attributes MOdeling Behavior for Identifying Users across Sites Behavioral Modeling Machine Learning MOBIUS

Time & Memory Limitation Personal Attributes & Traits
Behaviors Human Limitation Time & Memory Limitation Knowledge Limitation Exogenous Factors Typing Patterns Language Patterns Endogenous Factors Personal Attributes & Traits Habits

Behavioral Modeling Approach with Learning
Generates Captured Via Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Learning Framework Identification Function Data

Summary – Making Data Bigger
Gathering more data is often necessary for effective data mining Reducing dimensionality can make data bigger Social media data provides unique opportunities to do so by using different sites and abundant user-generated content Traditionally available data can also be tapped to make thin data “thicker” Jundon Li, et al. ``Feature Selection: A Data Perspective", Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", SIGKDD, 2013.

Big Data Is Often Not Sufficiently Big
Can we always make our data bigger? Diminishing returns What do we sacrifice when making data bigger? Narrowing domains Deep learning is confined by available data Unexpected effectiveness of big data To accomplishing general AI, we need more and bigger data!

Repositories and Recent Books
scikit-feature – an open source feature selection repository in Python Social Computing Repository Books: SMM and TDA (in Python) free download

Discovering Social Media Intelligence
Graph Theories Network Measures and Models Data Mining, NLP, and Visual Analytics Community Detection and Analysis Information Diffusion Influence and Homophily Recommender Systems Behavior Analytics Sentiment analysis We use social media data to acquire intelligence about user behavior, user needs, sentiment, opinions, and trends.

THANK YOU ALL & BigMine17 Organizers
for this opportunity to share our research Acknowledgments Grants from NSF, ONR, ARO, among others DMML members and project leaders Collaborators: CMU (Minerva), CRA (IARPA-CAUSE) More information by searching for “Huan Liu” or at CRA Charles River Aanlytics

More Challenges in Acquiring SM Intelligence
Social media data is obviously big, but why are we often still short of data? How can we make data `bigger’? Data is power, so it can produce any result Can we algorithmically evaluate the results from big data? We don’t know what we don’t know How can we know if our result of social media analysis is of any value? Make thin data thicker? Can machine learning help? How?

Evaluation without Ground Truth
is in both English and Chinese The CACM article can be found at dl.acm.org

Further Readings Jundong Li and Huan Liu. ``Challenges of Feature Selection for Big Data Analytics", Special Issue on Big Data, IEEE Intelligent Systems. 32 (2), Fred Morstatter and Huan Liu. ``A Novel Measure for Coherence in Statistical Topic Models", Association of Computational Linguistics (ACL), August Berlin, Germany Reza Zafarani and Huan Liu. ``Evaluation without Ground Truth in Social Media Research", Communications of ACM, Volume 58 Issue 6, June 2015 Pages Lei Tang and Huan Liu. "Community Detection and Mining in Social Media", Morgan & Claypool Publishers, September 2010.

Big Data + Deep Learn = A Universal Solution?

Similar presentations

Presentation on theme: "Big Data + Deep Learn = A Universal Solution?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data + Deep Learn = A Universal Solution?

Similar presentations

Presentation on theme: "Big Data + Deep Learn = A Universal Solution?"— Presentation transcript:

Similar presentations

About project

Feedback