Towards Reliable Hypothesis Validation in Social Sensing Applications

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
1 Research Profile Guoliang Xing Assistant Professor Department of Computer Science and Engineering Michigan State University.
Overview and Mathematics Bjoern Griesbach
Consensus-based Distributed Estimation in Camera Networks - A. T. Kamal, J. A. Farrell, A. K. Roy-Chowdhury University of California, Riverside
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
1.Research Motivation 2.Existing Techniques 3.Proposed Technique 4.Limitations 5.Conclusion.
Computational Tools for Population Biology Tanya Berger-Wolf, Computer Science, UIC; Daniel Rubenstein, Ecology and Evolutionary Biology, Princeton; Jared.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Research Design
1. Dong Wang, Md Tanvir Amin, Shen Li, Tarek Abdelzaher, Siyu Gu, Chenji Pan University of Illinois at Urbana Champaign, Urbana, IL, USA Lance Kaplan.
Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.
A Collaborative Quality Ranking Framework for Cloud Components
CS 9633 Machine Learning Support Vector Machines
Data Reliability II: A Fundamental Challenge in Social Sensing
Introduction to Research Methodology
By Arijit Chatterjee Dr
Computational Reasoning in High School Science and Math
Research Methods Dr. X.
Themes in Geosciences.
Ramesh Jain Events in Data Science Ramesh Jain
IB Assessments CRITERION!!!.
Science Behind Cross-device Conversion Tracking
Reading Notes Wang Ning Lab of Database and Information Systems
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Spring 2015 CSE 40437/60437 Prof. Dong Wang Summary
System Control based Renewable Energy Resources in Smart Grid Consumer
Simulation-Based Approach for Comparing Two Means
Introduction to Research Methodology
Approximate Models for Fast and Accurate Epipolar Geometry Estimation
Collective Network Linkage across Heterogeneous Social Platforms
Chapter 10 Verification and Validation of Simulation Models
Yue Zhang, Nathan Vance, and Dong Wang
Geospatial Technology Evolution and Future Trends
th IEEE International Conference on Sensing, Communication and Networking Online Incentive Mechanism for Mobile Crowdsourcing based on Two-tiered.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Reasoning in Psychology Using Statistics
Latent Space Model for Road Networks to Predict Time-Varying Traffic
Presenter: Xudong Zhu Authors: Xudong Zhu, etc.
1 Department of Engineering, 2 Department of Mathematics,
Hidden Markov Models Part 2: Algorithms
Objective of This Course
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Reasoning in Psychology Using Statistics
10701 / Machine Learning Today: - Cross validation,
INTEGRATED LEARNING CENTER
IT Megatrends that shape the Digital Future…
CS6501 Embedded Operating Systems for the IoT
What processes do scientists use when they perform scientific investigations? Chapter Introduction.
Psych 231: Research Methods in Psychology
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Psych 231: Research Methods in Psychology
Paper Reading Dalong Du April.08, 2011.
Reasoning in Psychology Using Statistics
Psych 231: Research Methods in Psychology
Biological Science Applications in Agriculture
Managerial Decision Making and Evaluating Research
GhostLink: Latent Network Inference for Influence-aware Recommendation
Presentation transcript:

Towards Reliable Hypothesis Validation in Social Sensing Applications Dong Wang, Daniel Zhang, Chao Huang Department of Computer Science and Engineering University of Notre Dame -- Add introduction to the recursive truth estimation problem: In this presentation, I am going to talk about the truth estimation or truth discovery problem ins crowdsourcing applications. Let me first give you a brief idea of what do we mean by truth estimation in crowdsourcing applications. One important feature about the crowdsourcing applications is that the data collection paradigm is open to all, hence anybody can contribute their data. Therefore how to ascertain the correctness of the observations contributed by the crowd become a critical challenge. We call this challenge truth estimation. This problem is made difficult by the fact that there is usually no prior knowledge on the source reliability since any concerned citizen can upload their measurements in crowdsourcing applications and it is impossible to screen all sources beforehand. And the time scale of crowd-sourcing applications can be as small as a few hours or days, which does not provide enough history for normal reputation system to converge. In this talk, we will introduce a new recursive fact-finding approach we developed to solve the truth estimation problem with explicit consideration of streaming data in crowdsourcing applications. The problem becomes even harder when consider real-time streaming data, because data are coming and going away fast, which offers little time and history for us to analyze the source and information credibility. In this paper we are trying to take a first step to address this challenging problem. SECON 18, Hong Kong, China

Sensing is Evolving Think a bit more to describe the trend

Sensors are increasingly used by everyday people Sensing is Evolving Platform Sensors are increasingly used by everyday people Smart Phone Think a bit more to describe the trend

Sensing is Evolving Social (Human-Centric) Sensing is Emerging! Platform Sensors are increasingly used by everyday people Smart Phone Social (Human-Centric) Sensing is Emerging! Application Think about slides to two types of social sensing after this Human are getting into the Loop of Sensing Health Monitoring Geotagging Target Tracking Environment Monitoring Social Sensing Smart House

Social Sensing Human + Cyber + Physical A set of applications where data are collected from human sources or devices on their behalf. Human + Cyber + Physical Twitter Mood Predicts Stock Market, 2011 Help Pilgrims utilize schedule in Hajj , 2012 An Emerging Paradigm of Cyber-Physical Systems with Human-in-the-loop FourSquare helps blind people navigate , 2012 Japan Tsunami and Nuclear Event, 2011

Why Social Sensing? A Confluence of Three Trends Mass Dissemination Media Sensors Connectivity Smart Phone Cars on Internet This area of study is increasingly important because of three trends Emphasize the trends, delete pictures Summarize: why three trends make social sensing emerging Social sensing becomes an emerging area of research due to a few recent technical trends: First, with the appearance of smartphones and other digital devices, various kinds of sensors are integrated together on a common platform that every day people can use. For example, we have proximity sensors, GPS, accelerometer, compass, microphone and camera in our smartphones. The second trend is about connectivity, with the rapid development of wireless and mobile communication technology such as 4G, Wimax, we are now able to upload our data almost anywhere at anytime. The last trend is the advent of mass dissemination media such as Twitter, Facebook and Flickr, with them people are able to share what they have observed in a timely fashion with a much larger population than ever before. Sensors in platforms http://mobiledeviceinsight.com/2011/12/sensors-in-smartphones/ Smart Meter GPS Cell-phones

Truth Discovery in Social Sensing Who to believe? What to believe? Text Reliable Information for Decision Support! People In this paper, we explicitly looked at the problem on how to explore the physical constraints for reliable social sensing? Big Question: Why are the challenges unique in social sensing? Why cannot you use previous techniques to solve this challenge? My idea: Sources are in general unvetted to applications and their reliability is unknown; There is no independent ways to verify the correctness of their claims; We jointly estimate both source reliability and information correctness, at the same time, we also provide the confidence of our estimation Numeric data Smart Devices Images Sources Measurements (Claims)

Our Problem: Reliable Hypothesis Validation

Related Work Dynamic and Scalable Model 5 ICDCS 17 Truth Discovery IPSN 12 Recursive Model 2 IPSN 14, 16 SECON 18 Basic Model 1 Source Dependency 3,4 Reliable Hypothesis Validation 6 1. Dong Wang, Lance Kaplan, Hieu Le, and Tarek Abdelzaher. "On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach." IPSN 12, Beijing, China April 2012. 2. Dong Wang, Tarek Abdelzaher, Lance Kaplan and Charu C. Aggarwal. "Recursive Fact-finding: A Streaming Approach to Truth Estimation in Crowdsourcing Applications.", ICDCS 13, Philadelphia, PA, July 2013. 3. Dong Wang, Tarek Abdelzaher and Lance Kaplan. "Humans as Sensors: An Estimation Theoretic Perspective.” IPSN 14, Berlin, Germany, April, 2014. 4. Chao Huang, Dong Wang. "Topic-Aware Social Sensing with Arbitrary Source Dependency Graphs," IPSN 16, Vienna, Austria, April, 2016 5. Daniel Zhang, Chao Zhang, Dong Wang, Doug Thain, Xin Mu, Greg Madey and Chao Huang. "Towards Scalable and Dynamic Social Sensing Using A Distributed Computing Framework," ICDCS17, Atlanta, GA, USA 6. Dong Wang, Daniel Zhang, Chao Huang*. "Towards Reliable Hypothesis Validation in Social Sensing Applications", SECON'18 , Hong Kong, June, 2018. If you are interested in working reliable sensing problem, here are some related work on this problem for your reference.

Technical Challenges Challenge 1: Hypothesis-Claim Matching How to match the high-level hypotheses generated by end users to the relevant low-level claims generated by social sensors? Challenge 2: Hypothesis Validation How to reliably validate the truthfulness of the hypotheses from the estimated truthfulness of the claims? Add some examples to illustrate the challenges

Basic Definitions Sources: Claims: Hypotheses: Claim Truthfulness Vector: Hypothesis Truthfulness Vector:

Basic Definitions Source Claim Matrix: SC (M by N) M: Number of sources; N: Number of claims. N Source Si reports claim Cj 1 Source Si does not report claim Cj M

Degree of correlation bertween claim Cj and hypothesis Hk Basic Definitions Claim Hypothesis Matrix: CH (N by K) N: Number of Claims; K: Number of Hypothesis K Degree of correlation bertween claim Cj and hypothesis Hk 0.7 N

Our Goal Output: Hypothesis Truthfulness Estimated Claim Truthfulness

Solution: Reliable Hypothesis Validation (RHV) 1. Topic Identification from Claims 2. Hypothesis Claim Matching 3. Optimal Hypothesis Validation

RHV: Topic Identification from Claims Objective: Identify important topics that provide clues to help end users generate relevant hypotheses Approach: Topic Modeling and Gibbs Sampling Algorithm Output: T topics associated with a list of words that are strongly correlated with each topic  We assume each claim is associated with a distribution over all topics, which is denoted as \theta . Each topic is associated with a distribution over all the words in the vocabulary, which is denoted as \phi . Refer to the previous example Claim distribution over topics Topic distribution over words

RHV: Hypothesis Claim Matching Objective: Match the hypothesis from end users to the most relevant claims that can be used to validate its correctness Approach: Compute the similarity between hypothesis Sematic Similarity (words) Syntactic Similarity (order of words) Overall Claim Hypothesis Similarity Semantic vector and Syntatic vector  Wordnet https://nlpforhackers.io/wordnet-sentence-similarity/

RHV: Hypothesis Claim Matching Maximize the relevance between claims and hypothesis Approach: Critical Claim Selection: Solve the multi-objective optimization problem using linear combination: Minimize the dependency between claims  With the above definitions, our goal of critical claim selection is to identify a set of critical claims (denoted by C*) which are relevant to a hypothesis set H  by maximizing their relevance scores and minimizing their dependency scores. Multi-objective optimization with constraints

RHV: Optimal Hypothesis Validation Objective: Validate the truthfulness of hypotheses from the estimated truthfulness of the identified critical and relevant claims Approach: Claim Truthfulness Estimation Truth Discovery Solutions Optimal Hypothesis Validation Reliable Hypothesis Validation Scheme

An Example of Truth Discovery Solutions: Expectation Maximization Z={z1, z2, …zN}: Correctness Sensing Observations Estimation parameter Observed data Hidden Variable X Apply EM Expectation Step (E-step) Think more to describe the mapping Maximization Step (M-step) Find MLE of estimation parameter and values of hidden variables

RHV: Optimal Hypothesis Validation Optimization Formulation Approach: Weighted Mean Algorithm CH Matrix Further improve it

Evaluation: A Real World Application Unreliable and Noisy Tweets Unreliable and Hypothesis-ignorant Users Think more to describe tweeter Paris Charlie Hebdo Attack, Nov. 2015 Oregon Shooting, Oct. 2015 Baltimore Riots, April, 2015

RHV is integrated as an option for data analysis Evaluation: Data Collection http://apollo.cse.nd.edu/index.html RHV is integrated as an option for data analysis Keywords/Location Mention apollo?

Evaluation: Real-World Application Data Trace Statistics: Hypothesis Set Generation: 5 independent individuals serve as end users Each individual generated 30 hypotheses for each dataset Clean up the hypothesis set by removing redundant and non conclusive ones Manually collect ground truth labels for evaluation purpose Extend it

Evaluation: Performance Comparison (1/3) Paris Attack Data Trace (2015)  The performance gains of RHV  scheme are mainly achieved by i) the critical claim selection scheme that identifies the most relevant claims for a given hypothesis, and ii) the optimality of the hypothesis validation algorithm that explicitly considers the complex relationship between the hypotheses and the claims as we discussed in Section IV. Similar results are observed in other two datasets

Evaluation: Performance Comparison (2/3)

Execution Time Comparison Evaluation: Performance Comparison (3/3) Our Approach Mention apollo? Execution Time Comparison RHV is among the fastest in the compared schemes across different datasets

Future Work Explore more comprehensive claim and hypothesis matching approaches Consider a hierarchical structure from claims to hypothesis Explore logical relationship between hypotheses Validate the developed models to applications beyond Twitter

Conclusion This paper formulates a new hypothesis validation problem in social sensing A reliable hypothesis validation (RHV) framework to address two technical challenges (claim-hypothesis matching and hypothesis correctness validation) Evaluation using real world social sensing data collected from Twitter feeds

University of Notre Dame Thank You! Social Sensing Lab University of Notre Dame http://www3.nd.edu/~sslab/ dwang5@nd.edu