Big Data + Deep Learn = A Universal Solution?

Slides:

Advertisements

Similar presentations

Community Detection and Graph-based Clustering

Advertisements

Data Mining and Machine Learning Lab eTrust: Understanding Trust Evolution in an Online World Jiliang Tang, Huiji Gao and Huan Liu Computer Science and.

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.

1 Working with Social Media in Research Settings Victoria Wade Careers Consultant.

Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.

Mining Social Media: Looking Ahead Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning.

Discovering Overlapping Groups in Social Media Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu Arizona State University.

Connecting Users across Social Media Sites: A Behavioral-Modeling Approach Jingchi Zhang.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Introduction Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Introduction Facebook How does Facebook use your data? Where do you think.

Top Objectives: 1.Increase web traffic and exposure 2.Become definitive authority on Coffee 3.Increase sales to coffee centric Food Service Operators 4.Engage.

COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.

PRESENTATION BY BLESSING VAVA ONLINE JOURNALISM COURSE 10 JUNE UNIVERSITY OF WITWATERSRAND.

REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.

Knowing Your Facebook From Your Flickr Dan O’ Neill – -

Data Mining and Machine Learning Lab Network Denoising in Social Media Huiji Gao, Xufei Wang, Jiliang Tang, and Huan Liu Data Mining and Machine Learning.

Internet Skills The World Wide Web (Web) consists of billions of interconnected pages of information from a wide variety of sources. In this section: Web.

Data Mining and Machine Learning Lab Unsupervised Feature Selection for Linked Social Media Data Jiliang Tang and Huan Liu Computer Science and Engineering.

Social Media Getting Social in a Digital World. (And, why it matters to your business!)

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign User Profiling in Ego-network: Co-profiling Attributes and Relationships.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.

OCLC Online Computer Library Center 1 Social Media and Advocacy.

Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.

MOTIVATION AND CHALLENGE Big data Volume Velocity Variety Veracity Contributor Content Context Value 5 Vs of Big Data 3 Cs of Veracity.

Mining Social Media Data Arizona State University Data Mining and Machine Learning Lab Arizona State University Data Mining and Machine Learning Lab Nov.

What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.

Using Social Media at an Advanced Level John Dawe, CFRE Dawe Consulting, LLC facebook.com/johndawe.

ARCHIVES AND RECORDS MANAGEMENT PROFESSIONAL ASSOCIATION AND JOURNAL ANALYSIS Kim Edwards MARA September 2015.

Mining information from social media

Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.

Unsupervised Streaming Feature Selection in Social Media

Social Information Processing March 26-28, 2008 AAAI Spring Symposium Stanford University

Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Thotwaves Innovations Welcome To SMM & SMO Activity Plan

Queensland University of Technology

MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS

Introduction to Information and Communication Technologies

Top 20 Ways To Achieve Your Purpose With LinkedIn

Market Intelligence Analysis

Eick: Introduction Machine Learning

School of Computer Science & Engineering

Introduction to IR Research

The important use of Twitter in the Educators’ World

Social Media Measurement Tools

Objectives of the Course and Preliminaries

Summary Presented by : Aishwarya Deep Shukla

Nicole Steen-Dutton, ClickDimensions

E-Commerce Theories & Practices

Power of Social Media Analytics

Mining Spatio-Temporal Reachable Regions over Massive Trajectory Data

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

SOCIAL Networks Introduction Modified from

Goal, Question, and Metrics

An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.

#VisualHashtags Visual Summarization of Social Media Events using Mid-Level Visual Elements Sonal Goel (IIIT-Delhi), Sarthak Ahuja (IBM Research, India),

Data Warehousing and Data Mining

TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.

Fake News Detection - Social Article Fusion

Media Trends 2017 Edition.

Sr. Manager, Global Talent Acquisition

Course Summary ChengXiang “Cheng” Zhai Department of Computer Science

How Can Brands Best Understand Their Consumers: Social Listening

Social Media Management

A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.

Christoph F. Eick: A Gentle Introduction to Machine Learning

COIT 20253: Business Intelligence Using Big Data

CSE591: Data Mining by H. Liu

Presentation transcript:

Big Data + Deep Learn = A Universal Solution? Huan Liu This photo shows visitors, and former and current students in the lab. Some graduated. This talk is a discussion of various topics: AI, Social Media, Big Data, Social Media Intelligence I am very fortunate to have many diligent and intelligent students to work with me since I moved to ASU. I am truly grateful to them. My lab and future students benefit immensely from their original work, fame and success.

Big Data and Deep Learning Big data is ubiquitous as computing technology advances The success of KDD shows that ‘raw oil’ can be turned into valuable ‘products’ Numerous impressive results of deep learning revive neural networks to a new height for machine learning Are we close to having a universal solution? Data is the new oil Gordon Moore’s law

Big Data + Deep Learning = A Universal Solution? Yes, if we had sufficiently big data Do we often have enough data? Social media data is obviously big We use social media data in this discussion

Social Media Data Twitter Facebook Instagram 300 million users 500 million tweets / day 1% (5 million) released for research Facebook 2 billion users 422 million updates / day 196 million photos / day Instagram 700 million users 80 million photos / day Facebook Degree Distribution Thanks to Dr. Fred Morstatter Social media data is mainly user-generated with geo-spatial, pictorial, temporal, and social information. SM contains a lot of newly available information Instagram Users over Time

A Key to Success in Search of a Universal Solution Make “Big” Data Bigger

Making “Big” Data Bigger What is big data? A conventional answer is 4Vs A practitioner’s answer is more nuanced ‘Big’ data can be actually little or thin For machine learning or data mining to work, the more data, the better Make little data bigger Make thin data thicker 4Vs: Volume, Velocity, Variety, Veracity; and Value, Vulnerability A correct answer or philosopher’s answer is it depends, but depending on what? Use curse of dimensionality to increase data amount

Curse of Dimensionality: Required Samples Data sparsity becomes exponentially worse as feature dimensionality increases Conventional distance metric becomes ineffective as far and near neighbors have similar distances 3 samples per unit region 1 sample per region 1/3 sample per region http://nikhilbuduma.com/2015/03/10/the-curse-of-dimensionality/

Relevant, Redundant and Irrelevant Features Feature selection retains relevant features for learning and removes redundant or irrelevant ones For a binary classification task below, f1 is relevant, f2 is redundant given f1, and f3 is irrelevant Colors show two classes

Feature Selection Feature selection selects an ‘optimal’ subset of relevant features from the original high-dimensional data given a certain criterion feature selection

Feature Selection and scikit-feature Feature selection can make data `bigger’ Assuming all binary attribute values in our toy example Before FS, 5/210 = 5/1024, after FS, 5/23 = 5/8 Does FS always work? Yes, for most high-d data Where can we find it? scikit-feature, an open- source repository in Python 5/1024 < 0.5%, 5/8 > 50%

Making Thin Data Thicker Most people like many of us are in the long tail Our data is often thin or sparse With little data, machine learning is powerless Social media data offers new opportunities Multiple facets: posts, profile, linked information Multiple platforms that offer different functions Two case studies Feature selection using social network information Connecting users across more than one social media site For the first case, we use addition facet of data (or data variety) to help enrich data; for the second case, we use informtation from multiple sites to make data thicker

Use Link Information for Data Thickening Where can we find additional information for feature selection Social media data contains various types of data Link information is additional Other sources such as sentiment, like, etc. Are there theories to guide us in using link info? Social influence Homophily Extracting distinctive relations from linked data for feature selection Having additional information does not mean that we have to use it

Representation for Social Media Data And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first. Social Context

Relation Extraction CoPost CoFollowing CoFollowed Following CoPost – posts belong to a user, CoFollowing - users follow the same person, CoFollowed - followed by the same people, Following – a user follows another user CoPost CoFollowing CoFollowed Following

Evaluation Results on Digg Data And the following relations between users. Until now, we discuss the different between attribute-value data and social media data, we can see traditional feature selection is unequipped to social media data, that motivates our current work linkedFS. The problem of feature selection for linked social media data is relatively novel. Thus before going through the details of linkedFS, we would like to formally define the problem first.

Summary LinkedFS is evaluated under varied circumstances to understand how it works Link information can help feature selection for social media data Unlabeled data is more often in social media, unsupervised learning is more sensible, but also more challenging An unsupervised method is showcased in our KDD12 paper following social correlation theories Jiliang Tang and Huan Liu. `` Unsupervised Feature Selection for Linked Social Media Data'', the Eighteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2012. Jiliang Tang, Huan Liu. ``Feature Selection with Linked Data in Social Media'', SIAM International Conference on Data Mining, 2012.

Gather More Data with Little Data Collectively, social media data is indeed big For an individual, however, the data is small How much activity data do we generate daily? How many posts did we post this week? How many friends do we have? When “big” social media data isn’t big, Searching for more data with little data We use different social media services for varied purposes LinkedIn, Facebook, Twitter, Instagram, YouTube, … Activity data: comment, like, share, retweet; 4Vs

An Example Can we connect individuals across sites? Reza Zafarani - Little data about an individual + Many social media sites - Partial Information + Complementary Information > Better User Profiles LinkedIn Twitter Age Location Education N/A Phoenix Area ASU (2014) N/A Tempe, AZ ASU - How hard can it be to identify an individual across sites? - Privacy Experts Claim Advertisers Know a lot about People - Can they stop showing you the same repetitive ads across sites? Connectivity is not available Consistency in Information Availability Can we connect individuals across sites? Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", the Nineteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2013), August 11 - 14, 2013. Chicago, Illinois.

Searching for More Data with Little Data Each social media site can have varied amount of user information Which information definitely exists for all sites? Usernames But, a user’s usernames on different sites can be different Our work is to connect the information of the same user provided across sites Why is it not an entity resolution problem? We define our problem as an identification problem and we solve it via machine learning

Our Behavior Generates Information Redundancy Information shared across sites provides a behavioral fingerprint How to capture and use differentiable attributes MOdeling Behavior for Identifying Users across Sites Behavioral Modeling Machine Learning MOBIUS

Time & Memory Limitation Personal Attributes & Traits Behaviors Human Limitation Time & Memory Limitation Knowledge Limitation Exogenous Factors Typing Patterns Language Patterns Endogenous Factors Personal Attributes & Traits Habits

Behavioral Modeling Approach with Learning Generates Captured Via Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Learning Framework Identification Function Data

Summary – Making Data Bigger Gathering more data is often necessary for effective data mining Reducing dimensionality can make data bigger Social media data provides unique opportunities to do so by using different sites and abundant user-generated content Traditionally available data can also be tapped to make thin data “thicker” Jundon Li, et al. ``Feature Selection: A Data Perspective", 2016. http://arxiv.org/abs/1601.07996 Reza Zafarani and Huan Liu. ``Connecting Users across Social Media Sites: A Behavioral-Modeling Approach", SIGKDD, 2013.

Big Data Is Often Not Sufficiently Big Can we always make our data bigger? Diminishing returns What do we sacrifice when making data bigger? Narrowing domains Deep learning is confined by available data Unexpected effectiveness of big data To accomplishing general AI, we need more and bigger data!

Repositories and Recent Books scikit-feature – an open source feature selection repository in Python Social Computing Repository Books: SMM and TDA (in Python) free download

http://dmml.asu.edu/smm/

Discovering Social Media Intelligence Graph Theories Network Measures and Models Data Mining, NLP, and Visual Analytics Community Detection and Analysis Information Diffusion Influence and Homophily Recommender Systems Behavior Analytics Sentiment analysis We use social media data to acquire intelligence about user behavior, user needs, sentiment, opinions, and trends. http://dmml.asu.edu/smm/

THANK YOU ALL & BigMine17 Organizers for this opportunity to share our research Acknowledgments Grants from NSF, ONR, ARO, among others DMML members and project leaders Collaborators: CMU (Minerva), CRA (IARPA-CAUSE) More information by searching for “Huan Liu” or at http://www.public.asu.edu/~huanliu CRA Charles River Aanlytics

More Challenges in Acquiring SM Intelligence Social media data is obviously big, but why are we often still short of data? How can we make data `bigger’? Data is power, so it can produce any result Can we algorithmically evaluate the results from big data? We don’t know what we don’t know How can we know if our result of social media analysis is of any value? Make thin data thicker? Can machine learning help? How?

Evaluation without Ground Truth http://dl.acm.org/citation.cfm?doid=2783419.2666680 is in both English and Chinese The CACM article can be found at dl.acm.org

Further Readings Jundong Li and Huan Liu. ``Challenges of Feature Selection for Big Data Analytics", Special Issue on Big Data, IEEE Intelligent Systems. 32 (2), 9-15. 2017 Fred Morstatter and Huan Liu. ``A Novel Measure for Coherence in Statistical Topic Models", Association of Computational Linguistics (ACL), August 2016. Berlin, Germany Reza Zafarani and Huan Liu. ``Evaluation without Ground Truth in Social Media Research", Communications of ACM, Volume 58 Issue 6, June 2015 Pages 54-60. Lei Tang and Huan Liu. "Community Detection and Mining in Social Media", Morgan & Claypool Publishers, September 2010.