Big Data + Deep Learn = A Universal Solution?

Slides:



Advertisements
Similar presentations
Community Detection and Graph-based Clustering
Advertisements

Data Mining and Machine Learning Lab eTrust: Understanding Trust Evolution in an Online World Jiliang Tang, Huiji Gao and Huan Liu Computer Science and.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
1 Working with Social Media in Research Settings Victoria Wade Careers Consultant.
Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning.
Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach Jingchi Zhang.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Introduction Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Introduction Facebook How does Facebook use your data? Where do you think.
Top Objectives: 1.Increase web traffic and exposure 2.Become definitive authority on Coffee 3.Increase sales to coffee centric Food Service Operators 4.Engage.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
PRESENTATION BY BLESSING VAVA ONLINE JOURNALISM COURSE 10 JUNE UNIVERSITY OF WITWATERSRAND.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
Knowing Your Facebook From Your Flickr Dan O’ Neill – -
Data Mining and Machine Learning Lab Network Denoising in Social Media Huiji Gao, Xufei Wang, Jiliang Tang, and Huan Liu Data Mining and Machine Learning.
Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.
Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.
Social Media Getting Social in a Digital World. (And, why it matters to your business!)
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
OCLC Online Computer Library Center 1 Social Media and Advocacy.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
MOTIVATION AND CHALLENGE Big data Volume Velocity Variety Veracity Contributor Content Context Value 5 Vs of Big Data 3 Cs of Veracity.
Mining Social Media Data Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab Nov.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Using Social Media at an Advanced Level John Dawe, CFRE Dawe Consulting, LLC facebook.com/johndawe.
ARCHIVES AND RECORDS MANAGEMENT PROFESSIONAL ASSOCIATION AND JOURNAL ANALYSIS Kim Edwards MARA September 2015.
Mining information from social media
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Unsupervised Streaming Feature Selection in Social Media
Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Thotwaves Innovations Welcome To SMM & SMO Activity Plan
Queensland University of Technology
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Introduction to Information and Communication Technologies
Top 20 Ways To Achieve Your Purpose With LinkedIn
Market Intelligence Analysis
Eick: Introduction Machine Learning
School of Computer Science & Engineering
Introduction to IR Research
The important use of Twitter in the Educators’ World
Social Media Measurement Tools
Objectives of the Course and Preliminaries
Summary Presented by : Aishwarya Deep Shukla
Nicole Steen-Dutton, ClickDimensions
E-Commerce Theories & Practices
Power of Social Media Analytics
Mining Spatio-Temporal Reachable Regions over Massive Trajectory Data
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
SOCIAL Networks Introduction Modified from
Goal, Question, and Metrics
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),
Data Warehousing and Data Mining
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Fake News Detection - Social Article Fusion
Media Trends 2017 Edition.
Sr. Manager, Global Talent Acquisition
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
How Can Brands Best Understand Their Consumers: Social Listening
Social Media Management
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Christoph F. Eick: A Gentle Introduction to Machine Learning
COIT 20253: Business Intelligence Using Big Data
CSE591: Data Mining by H. Liu
Presentation transcript:

Big Data + Deep Learn = A Universal Solution? Huan Liu This photo shows visitors, and former and current students in the lab. Some graduated. This talk is a discussion of various topics: AI, Social Media, Big Data, Social Media Intelligence I am very fortunate to have many diligent and intelligent students to work with me since I moved to ASU. I am truly grateful to them. My lab and future students benefit immensely from their original work, fame and success.

Big Data and Deep Learning Big data is ubiquitous as computing technology advances The success of KDD shows that ‘raw oil’ can be turned into valuable ‘products’ Numerous impressive results of deep learning revive neural networks to a new height for machine learning Are we close to having a universal solution? Data is the new oil Gordon Moore’s law

Big Data + Deep Learning = A Universal Solution? Yes, if we had sufficiently big data Do we often have enough data? Social media data is obviously big We use social media data in this discussion

Social Media Data Twitter Facebook Instagram 300 million users 500 million tweets / day 1% (5 million) released for research Facebook 2 billion users 422 million updates / day 196 million photos / day Instagram 700 million users 80 million photos / day Facebook Degree Distribution Thanks to Dr. Fred Morstatter Social media data is mainly user-generated with geo-spatial, pictorial, temporal, and social information. SM contains a lot of newly available information Instagram Users over Time

A Key to Success in Search of a Universal Solution Make “Big” Data Bigger

Making “Big” Data Bigger What is big data? A conventional answer is 4Vs A practitioner’s answer is more nuanced ‘Big’ data can be actually little or thin For machine learning or data mining to work, the more data, the better Make little data bigger Make thin data thicker 4Vs: Volume, Velocity, Variety, Veracity; and Value, Vulnerability A correct answer or philosopher’s answer is it depends, but depending on what? Use curse of dimensionality to increase data amount

Curse of Dimensionality: Required Samples Data sparsity becomes exponentially worse as feature dimensionality increases Conventional distance metric becomes ineffective as far and near neighbors have similar distances 3 samples per unit region 1 sample per region 1/3 sample per region http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

Relevant, Redundant and Irrelevant Features Feature selection retains relevant features for learning and removes redundant or irrelevant ones For a binary classification task below, f1 is relevant, f2 is redundant given f1, and f3 is irrelevant Colors show two classes

Feature Selection Feature selection selects an ‘optimal’ subset of relevant features from the original high-dimensional data given a certain criterion feature selection

Feature Selection and scikit-feature Feature selection can make data `bigger’ Assuming all binary attribute values in our toy example Before FS, 5/210 = 5/1024, after FS, 5/23 = 5/8 Does FS always work? Yes, for most high-d data Where can we find it? scikit-feature, an open- source repository in Python 5/1024 < 0.5%, 5/8 > 50%

Making Thin Data Thicker Most people like many of us are in the long tail Our data is often thin or sparse With little data, machine learning is powerless Social media data offers new opportunities Multiple facets: posts, profile, linked information Multiple platforms that offer different functions Two case studies Feature selection using social network information Connecting users across more than one social media site For the first case, we use addition facet of data (or data variety) to help enrich data; for the second case, we use informtation from multiple sites to make data thicker

Use Link Information for Data Thickening Where can we find additional information for feature selection Social media data contains various types of data Link information is additional Other sources such as sentiment, like, etc. Are there theories to guide us in using link info? Social influence Homophily Extracting distinctive relations from linked data for feature selection Having additional information does not mean that we have to use it

Representation for Social Media Data And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first. Social Context

Relation Extraction CoPost CoFollowing CoFollowed Following CoPost – posts belong to a user, CoFollowing - users follow the same person, CoFollowed - followed by the same people, Following – a user follows another user CoPost CoFollowing CoFollowed Following

Evaluation Results on Digg Data And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first.

Summary LinkedFS is evaluated under varied circumstances to understand how it works Link information can help feature selection for social media data Unlabeled data is more often in social media, unsupervised learning is more sensible, but also more challenging An unsupervised method is showcased in our KDD12 paper following social correlation theories Jiliang Tang and Huan Liu. `` Unsupervised Feature Selection for Linked Social Media Data'', the Eighteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2012. Jiliang Tang, Huan Liu. ``Feature Selection with Linked Data in Social Media'', SIAM International Conference on Data Mining, 2012. 

Gather More Data with Little Data Collectively, social media data is indeed big For an individual, however, the data is small How much activity data do we generate daily? How many posts did we post this week? How many friends do we have? When “big” social media data isn’t big, Searching for more data with little data We use different social media services for varied purposes LinkedIn, Facebook, Twitter, Instagram, YouTube, … Activity data: comment, like, share, retweet; 4Vs

An Example Can we connect individuals across sites? Reza Zafarani - Little data about an individual + Many social media sites - Partial Information + Complementary Information > Better User Profiles LinkedIn Twitter Age Location Education N/A Phoenix Area ASU (2014) N/A Tempe, AZ ASU - How hard can it be to identify an individual across sites? - Privacy Experts Claim Advertisers Know a lot about People - Can they stop showing you the same repetitive ads across sites? Connectivity is not available Consistency in Information Availability Can we connect individuals across sites? Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", the Nineteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2013), August 11 - 14, 2013. Chicago, Illinois.

Searching for More Data with Little Data Each social media site can have varied amount of user information Which information definitely exists for all sites? Usernames But, a user’s usernames on different sites can be different Our work is to connect the information of the same user provided across sites Why is it not an entity resolution problem? We define our problem as an identification problem and we solve it via machine learning

Our Behavior Generates Information Redundancy Information shared across sites provides a behavioral fingerprint How to capture and use differentiable attributes MOdeling Behavior for Identifying Users across Sites Behavioral Modeling Machine Learning MOBIUS

Time & Memory Limitation Personal Attributes & Traits Behaviors Human Limitation Time & Memory Limitation Knowledge Limitation Exogenous Factors Typing Patterns Language Patterns Endogenous Factors Personal Attributes & Traits Habits

Behavioral Modeling Approach with Learning Generates Captured Via Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Learning Framework Identification Function Data

Summary – Making Data Bigger Gathering more data is often necessary for effective data mining Reducing dimensionality can make data bigger Social media data provides unique opportunities to do so by using different sites and abundant user-generated content Traditionally available data can also be tapped to make thin data “thicker” Jundon Li, et al. ``Feature Selection: A Data Perspective", 2016. http://arxiv.org/abs/1601.07996 Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", SIGKDD, 2013.

Big Data Is Often Not Sufficiently Big Can we always make our data bigger? Diminishing returns What do we sacrifice when making data bigger? Narrowing domains Deep learning is confined by available data Unexpected effectiveness of big data To accomplishing general AI, we need more and bigger data!

Repositories and Recent Books scikit-feature – an open source feature selection repository in Python Social Computing Repository Books: SMM and TDA (in Python) free download

http://dmml.asu.edu/smm/

Discovering Social Media Intelligence Graph Theories Network Measures and Models Data Mining, NLP, and Visual Analytics Community Detection and Analysis Information Diffusion Influence and Homophily Recommender Systems Behavior Analytics Sentiment analysis We use social media data to acquire intelligence about user behavior, user needs, sentiment, opinions, and trends. http://dmml.asu.edu/smm/

THANK YOU ALL & BigMine17 Organizers for this opportunity to share our research Acknowledgments Grants from NSF, ONR, ARO, among others DMML members and project leaders Collaborators: CMU (Minerva), CRA (IARPA-CAUSE) More information by searching for “Huan Liu” or at http://www.public.asu.edu/~huanliu CRA Charles River Aanlytics

More Challenges in Acquiring SM Intelligence Social media data is obviously big, but why are we often still short of data? How can we make data `bigger’? Data is power, so it can produce any result Can we algorithmically evaluate the results from big data? We don’t know what we don’t know How can we know if our result of social media analysis is of any value? Make thin data thicker? Can machine learning help? How?

Evaluation without Ground Truth http://dl.acm.org/citation.cfm?doid=2783419.2666680 is in both English and Chinese The CACM article can be found at dl.acm.org

Further Readings Jundong Li and Huan Liu. ``Challenges of Feature Selection for Big Data Analytics", Special Issue on Big Data, IEEE Intelligent Systems. 32 (2), 9-15. 2017 Fred Morstatter and Huan Liu. ``A Novel Measure for Coherence in Statistical Topic Models", Association of Computational Linguistics (ACL), August 2016. Berlin, Germany Reza Zafarani and Huan Liu. ``Evaluation without Ground Truth in Social Media Research", Communications of ACM, Volume 58 Issue 6, June 2015 Pages 54-60. Lei Tang and Huan Liu. "Community Detection and Mining in Social Media", Morgan & Claypool Publishers, September 2010.