Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter Eiji ARAMAKI * Sachiko MASKAWA * Mizuki MORITA ** * The University of Tokyo ** National.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Stock Price Prediction Based on Social Network A survey Presented by: CHEN En.
University of Sheffield NLP Module 4: Machine Learning.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Large-Scale Entity-Based Online Social Network Profile Linkage.
CS 315 – Web Search and Data Mining. Overview The power of crowdsourcing Predicting flu outbreaks Predicting “the present” through Google Insights! Predicting.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
1 Accurate Object Detection with Joint Classification- Regression Random Forests Presenter ByungIn Yoo CS688/WST665.
Google Flu Trends Terminology –Influenza = flu –ILI = influenza like illness CDC ILI time series –Weekly –1-2 week publication lag Predicting it using.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
Change in Physician Behavior Regarding Ordering Rapid Flu Test in Patients Hospitalized with Acute Lower Respiratory Tract Infections During Flu Season.
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.
A Joint Model of Feature Mining and Sentiment Analysis for Product Review Rating Jorge Carrillo de Albornoz Laura Plaza Pablo Gervás Alberto Díaz Universidad.
Extraction of Adverse Drug Effects from Clinical Records E. ARAMAKI* Ph.D., Y. MIURA **, M. TONOIKE ** Ph.D., T. OHKUMA ** Ph.D., H. MASHUICHI ** Ph.D.,K.WAKI.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Sore throat? Sniffles?Sore throat? Sniffles?  Google it! Duh!  During flu season, more people enter search queries concerning the flu.  Each year 90.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Calculation of excess influenza mortality for small geographic regions Al Ozonoff, Jacqueline Ashba, Paola Sebastiani Boston University School of Public.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Detecting Influenza Outbreaks by Analyzing Twitter Messages By Aron Culotta Jedsada Chartree 02/28/11.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Prediction of Influencers from Word Use Chan Shing Hei.
Discriminative Dialog Analysis Using a Massive Collection of BBS comments Eiji ARAMAKI (University of Tokyo) Takeshi ABEKAWA (University of Tokyo) Yohei.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Bringing Together the Social and Technical in Big Data Analytics: Why You Can't Predict the Flu from Twitter, and Here's How David A. Broniatowski Asst.
Center for Computational Analysis of Social and Organizational Systems Dynamic Network Approach to Health Surveillance Prof.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Community Change By: Emily Alpers, Shirley Iler, Barbara Lentz, & Sharon Lumbert.
Some Final Material. GOOGLE FLU TRENDS Sore throat? Sniffles? Google it! Duh! During flu season, more people enter search queries concerning the flu.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Experience Report: System Log Analysis for Anomaly Detection
Machine Learning Week 1.
One Health Early Warning Alert
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets
Analyzing social media data to monitor public health trends
Presentation transcript:

Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter Eiji ARAMAKI * Sachiko MASKAWA * Mizuki MORITA ** * The University of Tokyo ** National Institute of Biomedical Innovation EMNLP2011

Why we developed this system? Let me show you several existing systems

Centers for Disease Control and Prevention (CDC)

Infection Disease Surveillance Center (IDSC)

European Influenza Surveillance Network (EISN)

Why each country has each surveillance system? Influenza epidemics are a major public health concern, because it causes tens of millions of illnesses each year. To reduce the victims, the early detection of influenza epidemics is a national mission in every country. BUT: These surveillance systems basically rely on hospital reports (written manually).

Two Problems & Recent Approach (1) Small Scale – For example, IDSC gathers influenza patient data from 5,000 clinics. But It does not cover all cities (especially local cities). (2) Time Delay (Time lag) – For example, the data gathering process typically has a 1–2 week reporting lag To deal with these problems – Recently, various approaches that directly capture people’s behavior are proposed

Recent Approach using Phone Call data – Espino et al. (2003) used data of a telephone triage service, a public service, to give an advice to users via telephone. They reported the number of telephone calls that correlates with influenza epidemics. using Drug sale data – Magruder (2003) used the amount drug sales. Among various approaches…

The State-of-the-Art Web based Approach Ginsberg et al. (Nature 2009) used Google web search queries that correlate with an influenza epidemic, such as “flu”, “fever”. Polgreen et al. (2008) used a Yahoo! query log. Hulth et al. (2009) used a query log of a Switzerland web search engine.

This Study Web search query is a extremely large scale and real-time data resource. BUT: the query data is closed (not freely available), which is available only for several companies, such as Google, Yahoo, or Microsoft. → This study examines Twitter data, which is widely available.

OUTLINE Background Objective Method Experiment Discussion Conclusion Detailed Task Definition

Simple Word Frequency in Twitter “Cold”, “Fever” & “influenza” WinterSummer Simple Word Frequency contains various noises Because…. Actual influenza curve is more smooth

Negative Influenza Tweet Positive Influenza Tweet A word “ influenza ” does not always indicate an influenza patient

Two types of Influenza Tweets Negative influenza tweet indicates an influenza patient Negative influenza tweet includes mention of “influenza”, but does not indicate that an influenza patient is present Not only the general news, but also various phenomena generate Negative influenza tweet… Negative Influenza Tweet Positive Influenza Tweet

Various Negative Influenza Tweet (1/2) Prevention – You need to get a influenza shot sometime soon. Modality (just suspition) might be suffering from influenza Question – Did you catch the influenza ?

Various Negative Influenza Tweet (2/2) Influenza of Cat or Dog – Today, I couldn't go home late. My cat caught the influenza... Influenza of TV Character – In the last episode of that TV Series, Ritsu-chan caught the flu

Research Questions In total, half of Influenza related tweets are negative, motivating an automatic filtering. RQ1: Could a NLP system filter out the negative influenza tweet? RQ2: Could this filtering contributes to the surveillance accuracy?

OUTLINE Background Method Experiment Discussion Conclusion

Basic Idea: Binary Classification We regard this task as a binary classification task, such as a spam mail filtering Positive Negative Training Corpus Training Corpus (2) What kind of Feature? (3) What kind of Machine Learning Method? (1) What kind of Corpus? input

See proceeding for detailed Average Annotator Agreement Ratio = 0.85 Corpus (5k Sentences with Labels)

What kind of Feature? I think the influenza is going around R1 R2 R3 L1L2 L3 Surrounding Words (BOW, no stemming, no POS) Among various settings, Window size = 6 achieved the highest accuracy Twitter contains many ungrammatical expressions

What kind of Machine Learning Method? ClassifierF-MeasureTime AdaBoost Bagging Decision Tree Logistic Regression Nearest Neighbor Random Forest SVM (polynomial; d=2) Among various settings, SVM achieved the feasible accuracy

OUTLINE Background Method Experiment Discussion Objective

Twitter Data ( ) First month is used for training corpus We divides the other data into 4 seasons – Twitter API sometimes changes the spec, leading to dropout periods. Season I Season II Season III Seaso n IV

Method Comparison & Evaluation (1) TWEET-SVM ( The proposed method) (2) TWEET-RAW – Based on simple word frequency of “ influenza ” (3) GOOGLE [Ginsberg 2009] – Based on Google web-search query – The previous estimation data is available at the Google Flu Trend website. (4) DRUG-SALE [Magruder 2003] Evaluation is based on – Average Correlation with GOLD_STANDARD DATA that is the real number of the influenza patients reported by Infection Disease Surveillance Center (IDSC)

Result: Correlation Ratio TWEET-RAWTWEET-SVMGOOGLEDRUG Season I Season II Season III Season IV Bold indicates the correlation > statistical significance level. In most seasons, the proposed method achieved the higher correlation than simple word freq-based method, demonstrating the advantage of the SVM based filtering +SVM

Result: Correlation Ratio TWEET-RAWTWEET-SVMGOOGLEDRUG Season I Season II Season III Season IV Bold indicates the correlation > statistical significance level. Except for Season II, the proposed method achieved almost the same accuracy to GOOGLE. Except for Season II, the proposed method achieved almost the same accuracy to GOOGLE. +SVM

Why Twitter suffers from Season II? Because it includes Pandemic! Suggesting Twitter might be biased by News Media TWEET-RAWTWEET-SVMGOOGLEDRUG Normal Season Pandemic Season WHO says Pandemic In 1999 Jul (Season II). WHO says Pandemic In 1999 Jul (Season II).

Season I TWEET-SVM ≒ GOOGLE Relative number

Season II Relative number TWEET-SVM << GOOGLE

OUTLINE Background Method Experiment Discussion Conclusion Extra Experiment

Frequent Question Could an Influenza Patient REALLY use a Twitter or Google Search? That seems to be un-natural situation! I’d like to sleep... Due to that, we modified the system assuming as follows: People use Twitter or Google at the first sign of the influenza

( ≒ Markov model) Implemented by using Infectious Model [Kermack1927] S S Susceptible I I R R Infectious Recover Catch the flu Recover S-to-I transition is observed by Twitter / Google 38% of Influenza people recover a day BEFORE FLU AFTER FLU UNDER FLU

BUT: It ALSO improves Google based Approach This model improves correlation of BOTH Twitter & GOOGLE. This result suggests that there is a room of collaboration between medical study and web/NLP study

OUTLINE Background Method Experiment Discussion Conclusion

Answer to Research Questions This study proposed a new influenza surveillance system using Twitter RQ1: Could a system filter out the negative influenza? – Yes. But NOT Perfect RQ2: Could this accuracy contribute to the surveillance performance? – YES. It increases the correlation (except for pandemic period). We could achieve the almost same accuracy to GOOGLE using freely available data.

Conclusion Still now, more than 100 (sometime over 1,000) people die from influenza in Japan We hope that this study might help people

Thank you NLP could save a life! Eiji ARAMAKI Ph.D. University of Tokyo Eiji ARAMAKI Ph.D. University of Tokyo