Google Flu Trends Terminology –Influenza = flu –ILI = influenza like illness CDC ILI time series –Weekly –1-2 week publication lag Predicting it using frequency of search queries –Machine learning problem –Logit Regression
Flu a big problem 10s of millions of cases every year, worldwide k deaths every year, worldwide Swine flu pandemic is worse Surveillance –CDC –European Influenza Surveillance Scheme (EISS) –Also, monitoring volume of calls to help-lines, and –Volume of over-the-counter sales
Regression I(t) = fraction of doctor’s visits due to flu Q(t) = fraction of search queries related to flu logit(·) used to map I(t), Q(t) from [0,1] to R –logit(p) = log [p/(1-p)] Regression: logit(I(t)) = logit(Q(t)) + + (t) – = error –Correlation is a performance measure
Which Queries? Training data: I 1 (t),…,I 9 (t) –9 regions –Some subset of 9/28/2003 to 3/11/2007 for which > 0 Candidate queries –Database of 50m most common search queries in US –Q j (t) = volume fraction of candidate query in region j –Calculate correlations j = corr( logit(Q j (·)), logit(I j (·)) ) –Z-transform the correlations to make them normally distributed –Average the Z-transformed correlations –Rank them
Which Queries? Q j (t) = sum of volume of top n=45 queries –Out-of-sample validation to choose n Avg. correlation of 0.9 over 9/28/2003 to 3/11/2007 Avg. correlation of 0.97 over 3/18/2007 to 5/11/2008 Query type# among top 45 Influenza complications11 Cold/flu remedy8 General flu symptoms5 Term for influenza4 Specific flu symptoms4 Symptoms of a flu complication4 Antibiotics3 General flu remedies2 Symptoms of a related disease2 Antivirals1 Related disease1
Extensions Geographic granularity –Example: predictions specific to state of Utah Not including query variants, misspellings, etc. Not using a weighted sum of query volumes –Just a plain sum –Queries volumes are very correlated
Discussion Use of early warning –New strain? Extra capacity needed? Public awareness? Searches may not indicate infection –But flu related news –Keeping secret the actual queries in the regression Not a substitute to actual surveillance –Demographics, genotype, …