Towards Reliable Hypothesis Validation in Social Sensing Applications Dong Wang, Daniel Zhang, Chao Huang Department of Computer Science and Engineering University of Notre Dame -- Add introduction to the recursive truth estimation problem: In this presentation, I am going to talk about the truth estimation or truth discovery problem ins crowdsourcing applications. Let me first give you a brief idea of what do we mean by truth estimation in crowdsourcing applications. One important feature about the crowdsourcing applications is that the data collection paradigm is open to all, hence anybody can contribute their data. Therefore how to ascertain the correctness of the observations contributed by the crowd become a critical challenge. We call this challenge truth estimation. This problem is made difficult by the fact that there is usually no prior knowledge on the source reliability since any concerned citizen can upload their measurements in crowdsourcing applications and it is impossible to screen all sources beforehand. And the time scale of crowd-sourcing applications can be as small as a few hours or days, which does not provide enough history for normal reputation system to converge. In this talk, we will introduce a new recursive fact-finding approach we developed to solve the truth estimation problem with explicit consideration of streaming data in crowdsourcing applications. The problem becomes even harder when consider real-time streaming data, because data are coming and going away fast, which offers little time and history for us to analyze the source and information credibility. In this paper we are trying to take a first step to address this challenging problem. SECON 18, Hong Kong, China
Sensing is Evolving Think a bit more to describe the trend
Sensors are increasingly used by everyday people Sensing is Evolving Platform Sensors are increasingly used by everyday people Smart Phone Think a bit more to describe the trend
Sensing is Evolving Social (Human-Centric) Sensing is Emerging! Platform Sensors are increasingly used by everyday people Smart Phone Social (Human-Centric) Sensing is Emerging! Application Think about slides to two types of social sensing after this Human are getting into the Loop of Sensing Health Monitoring Geotagging Target Tracking Environment Monitoring Social Sensing Smart House
Social Sensing Human + Cyber + Physical A set of applications where data are collected from human sources or devices on their behalf. Human + Cyber + Physical Twitter Mood Predicts Stock Market, 2011 Help Pilgrims utilize schedule in Hajj , 2012 An Emerging Paradigm of Cyber-Physical Systems with Human-in-the-loop FourSquare helps blind people navigate , 2012 Japan Tsunami and Nuclear Event, 2011
Why Social Sensing? A Confluence of Three Trends Mass Dissemination Media Sensors Connectivity Smart Phone Cars on Internet This area of study is increasingly important because of three trends Emphasize the trends, delete pictures Summarize: why three trends make social sensing emerging Social sensing becomes an emerging area of research due to a few recent technical trends: First, with the appearance of smartphones and other digital devices, various kinds of sensors are integrated together on a common platform that every day people can use. For example, we have proximity sensors, GPS, accelerometer, compass, microphone and camera in our smartphones. The second trend is about connectivity, with the rapid development of wireless and mobile communication technology such as 4G, Wimax, we are now able to upload our data almost anywhere at anytime. The last trend is the advent of mass dissemination media such as Twitter, Facebook and Flickr, with them people are able to share what they have observed in a timely fashion with a much larger population than ever before. Sensors in platforms http://mobiledeviceinsight.com/2011/12/sensors-in-smartphones/ Smart Meter GPS Cell-phones
Truth Discovery in Social Sensing Who to believe? What to believe? Text Reliable Information for Decision Support! People In this paper, we explicitly looked at the problem on how to explore the physical constraints for reliable social sensing? Big Question: Why are the challenges unique in social sensing? Why cannot you use previous techniques to solve this challenge? My idea: Sources are in general unvetted to applications and their reliability is unknown; There is no independent ways to verify the correctness of their claims; We jointly estimate both source reliability and information correctness, at the same time, we also provide the confidence of our estimation Numeric data Smart Devices Images Sources Measurements (Claims)
Our Problem: Reliable Hypothesis Validation
Related Work Dynamic and Scalable Model 5 ICDCS 17 Truth Discovery IPSN 12 Recursive Model 2 IPSN 14, 16 SECON 18 Basic Model 1 Source Dependency 3,4 Reliable Hypothesis Validation 6 1. Dong Wang, Lance Kaplan, Hieu Le, and Tarek Abdelzaher. "On Truth Discovery in Social Sensing: A Maximum Likelihood Estimation Approach." IPSN 12, Beijing, China April 2012. 2. Dong Wang, Tarek Abdelzaher, Lance Kaplan and Charu C. Aggarwal. "Recursive Fact-finding: A Streaming Approach to Truth Estimation in Crowdsourcing Applications.", ICDCS 13, Philadelphia, PA, July 2013. 3. Dong Wang, Tarek Abdelzaher and Lance Kaplan. "Humans as Sensors: An Estimation Theoretic Perspective.” IPSN 14, Berlin, Germany, April, 2014. 4. Chao Huang, Dong Wang. "Topic-Aware Social Sensing with Arbitrary Source Dependency Graphs," IPSN 16, Vienna, Austria, April, 2016 5. Daniel Zhang, Chao Zhang, Dong Wang, Doug Thain, Xin Mu, Greg Madey and Chao Huang. "Towards Scalable and Dynamic Social Sensing Using A Distributed Computing Framework," ICDCS17, Atlanta, GA, USA 6. Dong Wang, Daniel Zhang, Chao Huang*. "Towards Reliable Hypothesis Validation in Social Sensing Applications", SECON'18 , Hong Kong, June, 2018. If you are interested in working reliable sensing problem, here are some related work on this problem for your reference.
Technical Challenges Challenge 1: Hypothesis-Claim Matching How to match the high-level hypotheses generated by end users to the relevant low-level claims generated by social sensors? Challenge 2: Hypothesis Validation How to reliably validate the truthfulness of the hypotheses from the estimated truthfulness of the claims? Add some examples to illustrate the challenges
Basic Definitions Sources: Claims: Hypotheses: Claim Truthfulness Vector: Hypothesis Truthfulness Vector:
Basic Definitions Source Claim Matrix: SC (M by N) M: Number of sources; N: Number of claims. N Source Si reports claim Cj 1 Source Si does not report claim Cj M
Degree of correlation bertween claim Cj and hypothesis Hk Basic Definitions Claim Hypothesis Matrix: CH (N by K) N: Number of Claims; K: Number of Hypothesis K Degree of correlation bertween claim Cj and hypothesis Hk 0.7 N
Our Goal Output: Hypothesis Truthfulness Estimated Claim Truthfulness
Solution: Reliable Hypothesis Validation (RHV) 1. Topic Identification from Claims 2. Hypothesis Claim Matching 3. Optimal Hypothesis Validation
RHV: Topic Identification from Claims Objective: Identify important topics that provide clues to help end users generate relevant hypotheses Approach: Topic Modeling and Gibbs Sampling Algorithm Output: T topics associated with a list of words that are strongly correlated with each topic We assume each claim is associated with a distribution over all topics, which is denoted as \theta . Each topic is associated with a distribution over all the words in the vocabulary, which is denoted as \phi . Refer to the previous example Claim distribution over topics Topic distribution over words
RHV: Hypothesis Claim Matching Objective: Match the hypothesis from end users to the most relevant claims that can be used to validate its correctness Approach: Compute the similarity between hypothesis Sematic Similarity (words) Syntactic Similarity (order of words) Overall Claim Hypothesis Similarity Semantic vector and Syntatic vector Wordnet https://nlpforhackers.io/wordnet-sentence-similarity/
RHV: Hypothesis Claim Matching Maximize the relevance between claims and hypothesis Approach: Critical Claim Selection: Solve the multi-objective optimization problem using linear combination: Minimize the dependency between claims With the above definitions, our goal of critical claim selection is to identify a set of critical claims (denoted by C*) which are relevant to a hypothesis set H by maximizing their relevance scores and minimizing their dependency scores. Multi-objective optimization with constraints
RHV: Optimal Hypothesis Validation Objective: Validate the truthfulness of hypotheses from the estimated truthfulness of the identified critical and relevant claims Approach: Claim Truthfulness Estimation Truth Discovery Solutions Optimal Hypothesis Validation Reliable Hypothesis Validation Scheme
An Example of Truth Discovery Solutions: Expectation Maximization Z={z1, z2, …zN}: Correctness Sensing Observations Estimation parameter Observed data Hidden Variable X Apply EM Expectation Step (E-step) Think more to describe the mapping Maximization Step (M-step) Find MLE of estimation parameter and values of hidden variables
RHV: Optimal Hypothesis Validation Optimization Formulation Approach: Weighted Mean Algorithm CH Matrix Further improve it
Evaluation: A Real World Application Unreliable and Noisy Tweets Unreliable and Hypothesis-ignorant Users Think more to describe tweeter Paris Charlie Hebdo Attack, Nov. 2015 Oregon Shooting, Oct. 2015 Baltimore Riots, April, 2015
RHV is integrated as an option for data analysis Evaluation: Data Collection http://apollo.cse.nd.edu/index.html RHV is integrated as an option for data analysis Keywords/Location Mention apollo?
Evaluation: Real-World Application Data Trace Statistics: Hypothesis Set Generation: 5 independent individuals serve as end users Each individual generated 30 hypotheses for each dataset Clean up the hypothesis set by removing redundant and non conclusive ones Manually collect ground truth labels for evaluation purpose Extend it
Evaluation: Performance Comparison (1/3) Paris Attack Data Trace (2015) The performance gains of RHV scheme are mainly achieved by i) the critical claim selection scheme that identifies the most relevant claims for a given hypothesis, and ii) the optimality of the hypothesis validation algorithm that explicitly considers the complex relationship between the hypotheses and the claims as we discussed in Section IV. Similar results are observed in other two datasets
Evaluation: Performance Comparison (2/3)
Execution Time Comparison Evaluation: Performance Comparison (3/3) Our Approach Mention apollo? Execution Time Comparison RHV is among the fastest in the compared schemes across different datasets
Future Work Explore more comprehensive claim and hypothesis matching approaches Consider a hierarchical structure from claims to hypothesis Explore logical relationship between hypotheses Validate the developed models to applications beyond Twitter
Conclusion This paper formulates a new hypothesis validation problem in social sensing A reliable hypothesis validation (RHV) framework to address two technical challenges (claim-hypothesis matching and hypothesis correctness validation) Evaluation using real world social sensing data collected from Twitter feeds
University of Notre Dame Thank You! Social Sensing Lab University of Notre Dame http://www3.nd.edu/~sslab/ dwang5@nd.edu