Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing social media data to monitor public health trends

Similar presentations


Presentation on theme: "Analyzing social media data to monitor public health trends"— Presentation transcript:

1 Analyzing social media data to monitor public health trends
Dominic Trulli Mentored by Dr. Feiming Chen Introduction Results Conclusion Web-based data streams, including search engine statistics and social media messages, have emerged as complementary sources of data for public health surveillance. Tweets – status updates from the microblog Twitter – are a particularly promising data source due to the large volume and openness of the platform (Paul, Dredze, Broniatowski, & Generous, 2015). Several Twitter studies have demonstrated that aggregating millions of messages can provide valuable insights into a population (Paul, Dredze, 2011). This project intended to discover health trends that could be discovered from social media sites. It also intended to propose social media as an additional source to track disease spread in a population. Twitter data can be accessed instantly at a minimal cost compared to traditional health surveillance methods and it gives millions of data points daily which could be analyzed for health related content. This data could act as a leading indicator of disease pandemics and allow the public access to data of disease spread before it is released by public health officials. The purpose of this project was to determine if social media sources such as Twitter could be used to predict public health trends in the United States as accurately as traditional health surveillance methods. The data that was collected and analyzed was tested for statistical significance by calculating a Pearson Correlation value for each data set comparison. Graphs 1 and 2 show statistically significant correlations (95% and 70%, respectively, with p-values less than 0.01) between Twitter incident rates and either CDC (flu) or Gallup (allergy) incident rates. Figures 1 and 2 show the top twenty words found in tweets related to influenza or allergies in a word cloud. The size of the word in the word cloud is proportional to the frequency of tweets that contained that word. These word clouds allow researchers to learn the common nature of tweets that are related to allergies and influenza. These findings suggest that Twitter may be utilized to act as a quality leading indicator of common ailments in the United States, which inform healthcare workers and the public when they are most likely to appear in the population. Although high correlations were found comparing Twitter incident rates and both CDC and Gallup Survey incident rates, further research is needed over a larger time series interval with more data points to verify the accuracy of Twitter as a leading indicator of ailment trends that occur in the United States. Graph 1: Graphical display of the time series comparison of the Twitter-derived influenza incident rates versus the published CDC data. The Pearson Correlation Coefficient between the two time series is r = (p-value < 0.001) Materials and Methods A Twitter database containing millions of tweets from 2011 to was analyzed. These tweets have been processed in a statistical algorithm called the Ailment Topic Aspect Model (ATAM). This algorithm first sorted the tweets into health related and non-health related tweets. The algorithm then took the health related tweets and classified them into 20 different ailment groups. Due to time constraints, only 3 million tweets were analyzed from two out of the twenty ailment groups- influenza and allergies. The major word frequencies were also analyzed for the influenza and allergies groups to learn the nature of those tweets. Then the actual tweets were retrieved from the ailment groups of influenza and allergies and the monthly frequency of tweets about both groups were plotted against the officially reported incidence rates of influenza/allergy from the Center for Disease Control and Prevention (CDC)/Gallup survey data. The aforementioned data was then normalized so that the incident frequencies could be easily visualized and comparable between the two data sources. Pearson Correlations were calculated to compare the flu/allergy trends indicated by the Twitter data and the CDC/Gallup survey data. Graph 2: Graphical display of the time series comparison of the Twitter-derived allergy incident rates versus the published Gallup Survey data. The Pearson Correlation Coefficient between the two time series is r = (p-value = 0.003) References Paul, M. J., & Dredze, M. (2011). You are what you Tweet: Analyzing Twitter for public health. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media Paul, M. J., Dredze, M., Broniatowski, D. A., & Generous, N. (2015). Worldwide Influenza Surveillance through Twitter. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. Acknowledgments Figure 1 (top left): Word cloud graphic of the top 20 words found in influenza related tweets. Figure 2 (top right): Word cloud graphic of the top 20 words found in allergy related tweets. I would like to thank my mentor, Dr. Feiming Chen from Becton Dickinson, and my faculty advisor, Mrs. McDonough, for all of their help and guidance throughout the completion of this research project The data displayed above shows statistically significant data when comparing both Twitter-derived influenza and allergy posts to official data from the CDC and Gallup Survey.


Download ppt "Analyzing social media data to monitor public health trends"

Similar presentations


Ads by Google