Figure 4 - Sample of Data Collected Instances of Influenza in the United States Visualized Dr. Johann Thiel, Parth Patel - New York City College of Technology, CUNY – Fall 2018 Figure 2 - Fatal and Non-Fatal Instances of Influenza Discussion We needed to perform a data sanitization procedure to get a normalized and de-duplicated dataset (see Figure 4). This was accomplished in a performant manner using Pandas sorting capabilities and custom Python iterators to do the validation. Introduction The Tycho Project collects large data sets related to healthcare and in particular, instances and geographical information of diseases. We look at the instance counts and locations of Influenza from 1919-1951 across the United States. We hope to find seasonal and geographical insight to the spread of the disease. Figure 4 - Sample of Data Collected Research Questions Is there seasonal behavior to the instance counts of Influenza historically? Is there any pattern of disease spreading from one geographical area to another? Are there any sharp drops or increases in instance counts that could be explained by the introduction of vaccination or other preventative health measures? Methodology We use Pandas for the parsing and analysis of the data, and present the results as a Jupyter notebook. Analysis and results are avilable at https://github.com/parthpatel1001/tycho_influenza/blob/master/influenza_visualization.ipynb. We encounter an interesting data deduplication problem, and create helper functions to sanitize the data. Figure 3 - Heat Map Distribution of Instances of Influenza by State with Normalization Data sanitization is an important component in the process of analysis and is critical to ensuring valid and verifiable results. In future analysis, we can utilize predicative seasonal time series tools such as Facebook’s Prophet. In doing so, we can analyze the possibility of creating forecast ranges with confidence intervals. Additional research needs to be done to compute the number of flu cases occurring in the US after 1951. The Centers for Disease Control (CDC) publishes such numbers periodically and should be incorporated into the current project. Vaccine data can also be incorporated, particularly the introduction of vaccination programs in particular geographic locations, and their impact, both in that area and surrounding, on instance counts. Results The following figures represent some of the graphical results obtained by using various scientific computing modules in Python. Figure 1 - Fatal Instances of Influenza by State Conclusion As was to be expected, the data shows a seasonal trend in flu cases (see Figure 1 and 2). Furthermore, the heat map (Figure 3) gives a lot of interesting data. From it we can observe particular years where there was a national flu epidemic. We can also see how some geographically close states seem to have had localized outbreaks. References https://www.tycho.pitt.edu/