Analyzing social media data to monitor public health trends

Slides:



Advertisements
Similar presentations
Reeder et al. Perceived usefulness of a distributed community-based syndromic surveillance system: a pilot qualitative evaluation study. BMC Research Notes.
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Predicting Flu Trends using Twitter Data Harshavardhan Achrekar [1] Avinash Gandhe [ 2 ] Ross Lazarus [3] Ssu-Hsin Yu [2] Benyuan Liu [1] Workshop on Cyber-Physical.
Psychology: A Modular Approach to Mind and Behavior, Tenth Edition, Dennis Coon Appendix Appendix: Behavioral Statistics.
Table of Contents Exit Appendix Behavioral Statistics.
Big Data and Predictive Analytics in Health Care Presented by: Mehadi Sayed President and CEO, Clinisys EMR Inc.
Disasters and Human Factors Literature Nestor L Osorio Northern Illinois University.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
Google Flu Trends Terminology –Influenza = flu –ILI = influenza like illness CDC ILI time series –Weekly –1-2 week publication lag Predicting it using.
Tools for Publishing Environmental Observations on the Internet Justin Berger, Undergraduate Researcher Jeff Horsburgh, Faculty Mentor David Tarboton,
Media trends - market data correlation Assuming mass media events can have a significant impact to the market environment - service determines how informative.
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.
Overview of All SEER-Medicare Publications Through 2012 Mark D. Danese, MHS, PhD July 24, 2012.
Best practices in parliament web sites and the new Eduskunta.fi Mr. Aki Asola, information officer.
TAG-Org Websites 1. Why Websites ? Branding: Since it's our website, we can set the design and build the awareness of our brand. To create our own Online.
Information Management 12-1 Chapter 12. Learning Objectives Describe the purpose of data collection and reporting. Identify the data that should be collected.
IAWG GESC February 12, 2014 NYC, NY Retrospective Analysis of Reproductive Health and HIV/AIDS Indicators in United Nations High Commissioner for Refugees.
A Statistical Comparison of Weather Stations in Carberry, Manitoba, Canada.
Sore throat? Sniffles?Sore throat? Sniffles?  Google it! Duh!  During flu season, more people enter search queries concerning the flu.  Each year 90.
Eurostat Web activity evidence to increase timeliness of official statistics IAOS – 10 October.
QUANTITATIVE RESEARCH AND BASIC STATISTICS. TODAYS AGENDA Progress, challenges and support needed Response to TAP Check-in, Warm-up responses and TAP.
TEKS (6.10) Probability and statistics. The student uses statistical representations to analyze data. The student is expected to: (B) identify mean (using.
Detecting Influenza Outbreaks by Analyzing Twitter Messages By Aron Culotta Jedsada Chartree 02/28/11.
Eurostat WebDataNet Conference 2015 Salamanca, 26 th – 28 th May 2015 Fernando Reis, Big Data Task-Force European Commission (Eurostat) Web activity evidence.
1 Epidemiology 10/20/10MDufilho. 2 Epidemiology The study of the frequency and distribution of disease and health-related factors in human populations.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Bringing Together the Social and Technical in Big Data Analytics: Why You Can't Predict the Flu from Twitter, and Here's How David A. Broniatowski Asst.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Some Final Material. GOOGLE FLU TRENDS Sore throat? Sniffles? Google it! Duh! During flu season, more people enter search queries concerning the flu.
What Affects Students’ Performance in School? A report by: Justin Caldwell.
Data mining in web applications
Archival research: Ungraded review questions
HINARI/Health Information on the Internet (module 1.3 Part A)
By : Namesh Kher Big Data Insights – INFM 750
Market Intelligence Analysis
References/Acknowledgements
Flu Update and Overview of Flu Surveillance in RI
Smart IT Job Advisor and Analysis on web application
Sentiment analysis tools
Measuring Success Toolkit
Alisa Leonard Vice President, Marketing Strategy iCrossing
Elementary Statistics
Epidemic Alerts EECS E6898: TOPICS – INFORMATION PROCESSING: From Data to Solutions Alexander Loh May 5, 2016.
Test and evaluation of a soil salinity sensor incorporated with
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
The Practice of Statistics in the Life Sciences Fourth Edition
Analysis and classification of images based on focus
GROUP 3 – SENTIMENTAL TWITTER
2007 Southern California Wildfires Research
Renouncing Hotel’s Data Through Queries Using Hadoop
2-1 Data Summary and Display 2-1 Data Summary and Display.
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Public Health Surveillance
Topic 1: Statistical Analysis
IEEE Transactions Journals Scopus Viewpoint
Chapter 2: Statistics and Graphs (pg. 81)
What is the difference between an outbreak, epidemic, and a pandemic?
An Introduction to Correlational Research
Emerging Roles and Occupations in the Health Workforce
Figure 4 - Sample of Data Collected
Figure 1 Experimental workflow and predictive modelling
The three v’s of big data
LIRBARY RESOURCES AND SERVICES IN VIGNAN’S GROUP OF EDUCATIONAL INSTITUTIONS: A USAGE PATTERNS AND SATISFACTION LEVELS AMONG FACULTY Smt. A. Rajani Kumari.
Building Topic/Trend Detection System based on Slow Intelligence
Online health and community indicator platforms
2015 NINR Big Data in Symptoms Research Boot Camp Overview
STEPS Site Report.
Evaluation of the San Diego County Baby Track Program
A Coupled User Clustering Algorithm for Web-based Learning Systems
Presentation transcript:

Analyzing social media data to monitor public health trends Dominic Trulli Mentored by Dr. Feiming Chen Introduction Results Conclusion Web-based data streams, including search engine statistics and social media messages, have emerged as complementary sources of data for public health surveillance. Tweets – status updates from the microblog Twitter – are a particularly promising data source due to the large volume and openness of the platform (Paul, Dredze, Broniatowski, & Generous, 2015). Several Twitter studies have demonstrated that aggregating millions of messages can provide valuable insights into a population (Paul, Dredze, 2011). This project intended to discover health trends that could be discovered from social media sites. It also intended to propose social media as an additional source to track disease spread in a population. Twitter data can be accessed instantly at a minimal cost compared to traditional health surveillance methods and it gives millions of data points daily which could be analyzed for health related content. This data could act as a leading indicator of disease pandemics and allow the public access to data of disease spread before it is released by public health officials. The purpose of this project was to determine if social media sources such as Twitter could be used to predict public health trends in the United States as accurately as traditional health surveillance methods. The data that was collected and analyzed was tested for statistical significance by calculating a Pearson Correlation value for each data set comparison. Graphs 1 and 2 show statistically significant correlations (95% and 70%, respectively, with p-values less than 0.01) between Twitter incident rates and either CDC (flu) or Gallup (allergy) incident rates. Figures 1 and 2 show the top twenty words found in tweets related to influenza or allergies in a word cloud. The size of the word in the word cloud is proportional to the frequency of tweets that contained that word. These word clouds allow researchers to learn the common nature of tweets that are related to allergies and influenza. These findings suggest that Twitter may be utilized to act as a quality leading indicator of common ailments in the United States, which inform healthcare workers and the public when they are most likely to appear in the population. Although high correlations were found comparing Twitter incident rates and both CDC and Gallup Survey incident rates, further research is needed over a larger time series interval with more data points to verify the accuracy of Twitter as a leading indicator of ailment trends that occur in the United States. Graph 1: Graphical display of the time series comparison of the Twitter-derived influenza incident rates versus the published CDC data. The Pearson Correlation Coefficient between the two time series is r = 0.9505 (p-value < 0.001) Materials and Methods A Twitter database containing millions of tweets from 2011 to 2012 was analyzed. These tweets have been processed in a statistical algorithm called the Ailment Topic Aspect Model (ATAM). This algorithm first sorted the tweets into health related and non-health related tweets. The algorithm then took the health related tweets and classified them into 20 different ailment groups. Due to time constraints, only 3 million tweets were analyzed from two out of the twenty ailment groups- influenza and allergies. The major word frequencies were also analyzed for the influenza and allergies groups to learn the nature of those tweets. Then the actual tweets were retrieved from the ailment groups of influenza and allergies and the monthly frequency of tweets about both groups were plotted against the officially reported incidence rates of influenza/allergy from the Center for Disease Control and Prevention (CDC)/Gallup survey data. The aforementioned data was then normalized so that the incident frequencies could be easily visualized and comparable between the two data sources. Pearson Correlations were calculated to compare the flu/allergy trends indicated by the Twitter data and the CDC/Gallup survey data. Graph 2: Graphical display of the time series comparison of the Twitter-derived allergy incident rates versus the published Gallup Survey data. The Pearson Correlation Coefficient between the two time series is r = 0.7064 (p-value = 0.003) References Paul, M. J., & Dredze, M. (2011). You are what you Tweet: Analyzing Twitter for public health. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media Paul, M. J., Dredze, M., Broniatowski, D. A., & Generous, N. (2015). Worldwide Influenza Surveillance through Twitter. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. Acknowledgments Figure 1 (top left): Word cloud graphic of the top 20 words found in influenza related tweets. Figure 2 (top right): Word cloud graphic of the top 20 words found in allergy related tweets. I would like to thank my mentor, Dr. Feiming Chen from Becton Dickinson, and my faculty advisor, Mrs. McDonough, for all of their help and guidance throughout the completion of this research project The data displayed above shows statistically significant data when comparing both Twitter-derived influenza and allergy posts to official data from the CDC and Gallup Survey.