Leveraging Google search data to track influenza outbreaks in Africa

Karla Mejía; Cecile Viboud; Mauricio Santillana

doi:10.12688/gatesopenres.13072.1

Home Browse Leveraging Google search data to track influenza outbreaks in Africa

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Leveraging Google search data to track influenza outbreaks in Africa

[version 1; peer review: 1 approved, 1 not approved]

Karla Mejía¹, Cecile Viboud², Mauricio Santillana^3,4

PUBLISHED 31 Oct 2019

Author details Author details

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, 02115, USA
² Fogarty International Center, National Institutes of Health, Bethesda, Maryland, 20892, USA
³ Department of Pedriatics, Harvard Medical School, Boston, Massachusetts, 02115, USA
⁴ Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, 02115, USA

Karla Mejía
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Cecile Viboud
Roles: Funding Acquisition, Writing – Review & Editing

Mauricio Santillana
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background: Traditionally, public health agencies track seasonal influenza activity by collecting information from clinics, hospitals, and laboratories. The inherent slowness of the processes used to collect influenza activity data limits the ability of public health agencies to adapt to unexpected changes in influenza activity in near real-time. In recent years, new influenza surveillance methods that use nontraditional data sources, such as Google searches, have been proposed to successfully estimate influenza activity in near real-time. However, most of these methods have been designed for and implemented in high-income countries even though influenza disease burden remains high in low- to middle-income countries. Here, we seek to predict influenza activity in near real-time in Africa using machine learning models that combine Google searches with traditional epidemiological data.
Methods: We extend the AutoRegression with Google search data (ARGO) model to track influenza activity in near-real-time in Africa. The ARGO model, which was originally designed to predict influenza activity in the United States, combines influenza-related Google searches with historical laboratory-confirmed influenza trends. We evaluate the predictive performance of the ARGO model and compare it with several benchmark models in Algeria, Ghana, Morocco, and South Africa. We also explore the advantages and limitations of using Google search data to monitor influenza activity.
Results: In South Africa, Algeria, and Morocco, the ARGO model outperforms all benchmark models, suggesting that incorporating influenza-related Google search information in predictive models in these countries leads to improved predictions. In Ghana, however, the ARGO model and the autoregressive model of historical influenza activity have comparable performances.
Conclusions: These results demonstrate that the quality of the ARGO predictions is higher in regions where influenza activity is seasonal, historical influenza activity is recorded consistently, and the volume of influenza-related Google search queries is enough to appear as non-zero in the Google Trends tool.

Keywords

Real-time disease surveillance, Digital epidemiology, Google Flu Trends, Influenza monitoring, Seasonal influenza

Corresponding author: Mauricio Santillana

Competing interests: No competing interests were disclosed.

Grant information: The study was funded in part by the Bill and Melinda Gates Foundation (OPP 1195154).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2019 Mejía K et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.

How to cite: Mejía K, Viboud C and Santillana M. Leveraging Google search data to track influenza outbreaks in Africa [version 1; peer review: 1 approved, 1 not approved]. Gates Open Res 2019, 3:1653 (https://doi.org/10.12688/gatesopenres.13072.1) First published: 31 Oct 2019, 3:1653 (https://doi.org/10.12688/gatesopenres.13072.1) Latest published: 31 Oct 2019, 3:1653 (https://doi.org/10.12688/gatesopenres.13072.1)

Introduction

According to estimates from the U.S.A. Centers of Disease Control and Prevention (CDC), up to 650,000 people around the world die each year from complications related to seasonal influenza¹. The primary methods of influenza prevention include vaccination and good health habits, such as hand washing and covering one’s mouth when sneezing. Furthermore, populations must be revaccinated annually to keep up with the rapid evolution of influenza virus strains. The influenza vaccine is used routinely in high-income countries but less so in low- to middle-income countries². The constant threat of an influenza pandemic highlights the need for accurate influenza surveillance methods to guide local and global policies for preventing and mitigating the spread of influenza.

Traditionally, public health agencies, such as the U.S.A. CDC, track influenza activity by collecting respiratory specimen information and syndromic data from hospitals, clinics, and laboratories and analyzing them for the presence of the influenza virus. Unfortunately, these important and necessary healthcare-based approaches lag behind influenza activity in real-time by at least one or two weeks because of the inherent processing times of lab testing, data collection, data aggregation, and quality control. Indeed, first, an infected patient visits a clinic or hospital where health providers obtain a specimen to send to laboratories for testing. Then, the laboratory tests the specimen for influenza and the results of the specimen are reported to public health agencies which aggregate the results to the regional and national levels. This time delay limits the ability of public health agencies to respond to sudden and unexpected changes in influenza activity in real-time. To address this problem, researchers in the past decade have introduced influenza surveillance methods that combine information from nontraditional Internet-based data sources to estimate influenza activity in near-real-time^3–11.

In 2008, Google launched Google Flu Trends (GFT), an Internet-based tool which leveraged Google search queries of influenza-related terms to estimate influenza activity in near real-time in the United States⁵. In subsequent years, GFT expanded to provide influenza activity estimates for more than 25 countries. Despite early interest around GFT, the failure of GFT to capture the influenza pandemic of 2009 as well as the tendency of GFT to overestimate peak influenza activity in the 2013-2014 influenza season led to public speculation of the value of using nontraditional influenza surveillance methods^12–15. As described by Santillana¹⁶, Google discontinued GFT in 2015 after multiple revisions to their original approach.

In response to the shortcomings of GFT, researchers introduced promising methodological improvements to Google’s original approach^15–22. However, most of the proposed methods have been implemented in countries in the Americas and Europe even though sub-Saharan Africa, the western Pacific, and Southeast Asia have high influenza-related mortality rates¹. Surveillance methods that can help estimate influenza activity in near real-time in countries where traditional viral surveillance is sparse or delayed could be useful to local public health officials. In addition, although the use of the Google Search tool and social media preferences vary geographically, we expect the reliability of such methods to increase over time as Internet coverage continues to increase rapidly in these regions²³.

Our contribution

We extend the AutoRegression with GOogle search data (ARGO) model introduced by Yang et al. to track influenza viral activity in near real-time in four African countries: Algeria, Ghana, Morocco, and South Africa¹⁷. The predictions of the ARGO model rely on a combination of influenza-related Google searches and historical influenza trends²⁴. In order to understand the value of Google search information in monitoring influenza activity, we evaluate the out-of-sample prediction performance of the ARGO model during the study period and compare it with benchmark predictive models that only use either historical influenza activity or Google search information. Our findings suggest that incorporating influenza-related Google search information in predictive models leads to improved predictions in three out of the four African countries we studied.

Methods

Influenza surveillance data

We measured influenza activity using data from FluNet, a publicly available influenza surveillance database maintained by the World Health Organization (WHO) that tracks the weekly number of processed respiratory specimens (NPS) and confirmed influenza cases (NCC) in countries around the world²⁵. The NPS is the number of specimens from patients with influenza-like symptoms processed at national laboratories by type and subtype while the NCC is the subset of the NPS that tested positive for influenza. We obtained all available weekly FluNet reports from January 8, 2012 to October 7, 2018 for Algeria, Ghana, Morocco, and South Africa. The remaining African countries were excluded because they either had a large proportion of missing data (see Extended data²⁶, flunet heatmap and Supplementary Figure 1) during the study period or lacked influenza-related Google search data.

We used the complete case ratio (CCR) to measure influenza activity in each country. The CCR is defined as the NCC normalized by the NPS from the entire influenza season. Specifically, the CCR in week j of influenza season i is equal to the NCC in week j of influenza season i divided by the NPS in influenza season i as shown below:

{CCR}_{i j} = \frac{{NCC}_{i j}}{{NPS}_{i}} (1)

Hospitals in resource-limited regions frequently have limited numbers of influenza testing kits during each influenza season. Consequently, changes in the NPS and NCC over time may track the available number of testing kits and not actual changes in influenza activity. By using the CCR, we monitor the changes in the total confirmed influenza cases relative to the number of suspected cases or available testing kits.

Google search data

We used Google Correlate (GC) or Google Trends (GT) to identify influenza-related Google search terms in Algeria, Ghana, Morocco, and South Africa. GC and GT are publicly available tools developed by Google for analyzing Google Search data. GC allows users to obtain the most correlated search terms to any user-provided time series while GT allows users to input terms and obtain the most popular search queries made by users who also searched for the inputted terms. Since Morocco is the only African country supported by GC, we used the time series of the NPS reported in Morocco from January 6th, 2008 to January 6th, 2013 to obtain the most correlated search terms in Morocco. For the remaining countries, we used GT to obtain the most popular search queries made by users who also searched for influenza, influenza symptoms, and influenza-like illnesses in each country’s top official languages from January 6th, 2008 to January 6th, 2013. We manually excluded spurious terms such as "yuppie flu" and "rsvp" as well as Arabic terms since we are not literate in Arabic. See Extended data (Search terms and Supplementary Tables 1–4) for a list of the Google search terms identified for Algeria, Ghana, Morocco, and South Africa²⁶.

We accessed the Google Health Trends (GHT) API on November 12, 2018 to obtain the search frequencies from January 8, 2012 to December 31, 2017 of the search terms identified for Algeria, Ghana, Morocco, and South Africa. GHT is a private application programming interface (API) developed by Google that allows academic researchers to obtain the relative search frequencies of any search term. The search frequencies from GHT are relative because they represent the proportions of searches from a sample of all Google searches. We used the search frequencies as a proxy for public concern for influenza under the assumptions that search activities reflect public concern and that public concern for influenza is positively associated with influenza activity.

The search frequencies from GHT are left-censored because the API sets the search frequencies to zero when the number of searches does not exceed the privacy threshold set by Google to prevent researchers from dissecting the searches of individual users. As a result, models based on search terms with search frequencies that include zeros may underestimate the importance of those search terms because zero values do not always represent zero searches. In addition, since the results from the API only represent a sample of all Google searches, then results for search terms with low search volumes will have higher variability and thus introduce more error than results for search terms with high search volumes. To minimize error due to sampling variability, we calculated the proportion of zeros for each search term over the initial training period from January 8, 2012 to January 5, 2014 and manually excluded search terms with proportions of zeros exceeding 0.25.

Models

We implemented the ARGO model and several benchmark models to obtain retrospective out-of-sample estimates of the weekly CCR in Algeria, Ghana, Morocco, and South Africa under the assumption that we only had access to data available at the time of estimation. More specifically, we assumed that in order to produce predictions for time t, we had access to epidemiological data for the prior week of estimation (i.e., week t – 1) and influenza-related Google search activity up to the week of estimation (i.e., week t). For each country, the training and test periods were based on the availability of the historical influenza activity data reported by FluNet. We used moving training windows of 52 weeks for Morocco and moving training windows of 104 weeks for Algeria, Ghana, and South Africa. All analyses were implemented in Python 3.6.2 using scikit-learn version 0.20.0²⁷. Underlying data shows the data used in this analysis²⁴.

ARGO model

The ARGO model introduced by Yang et al. (2015) is a dynamic penalized multivariate linear regression that combines historical influenza activity data with influenza-related Google search frequencies to predict influenza activity in the United States¹⁷. We adapted the ARGO framework to predict influenza activity in Algeria, Ghana, Morocco, and South Africa by combining historical influenza activity data reported by FluNet with influenza-related Google search frequencies. The ARGO model is motivated by two assumptions: (a) unobserved influenza activity at time t depends on previously observed influenza activity, and (b) changes in influenza activity at time t lead to changes in influenza-related Google search activity at time t. In other words, we assume that if more people are affected by influenza, then more people will search for influenza-related information on the Internet. These assumptions can be formalized as a hidden Markov model

\begin{matrix} y i, 1 : N \\ ↓ \\ X i, N \end{matrix} \begin{matrix} \to \\ \to \end{matrix} \begin{matrix} y i, 2 : N + 1 \to \\ ↓ \\ X i, N + 1 \to \end{matrix} \begin{matrix} \dots \\ \dots \end{matrix} \begin{matrix} \to \\ \to \end{matrix} \begin{matrix} y i, (T – N + 1) : T \\ ↓ \\ X i, T \end{matrix}

which results in the following mathematical representation of the ARGO model

y_{t} = μ_{y} + \sum_{j \in J} α_{j} y_{t - j} + \sum_{k \in K} β_{k} X_{k, t} + ε_{t}, ε_{t} \sim N (0, σ) (2)

where y_t represents the CCR, X_k,t represents the standardized Google search frequencies of term k at week t, and ε_t represents the noise which captures the complexity of the observed data that cannot be explained by the model. X_k,t is standardized to have mean zero and variance one. ε_t is assumed to be a Gaussian white noise process with mean zero and constant variance.

Since the number of predictors may exceed the number of observations, using least squares to estimate the model parameters will result in overfitting due to the curse of dimensionality. For that reason, the ARGO model imposes an L₁ penalty to estimate the parameters µ_i, α = (α₁, …, α_N), β = (β₁, … , β_K) that minimize the objective function

{\sum_{t = 1}^{T} (y_{t} - μ_{t} - \sum_{j \in J} α_{j} y_{t - j} + \sum_{k \in K} β_{k} X_{t, k})}^{2} + λ_{α} | α_{j} | + λ_{β} | β_{k} | (3)

with a moving training window and hyperparameters λ_α and λ_β. For simplicity, we restricted the hyperparameters to λ_α =λ_β and used 10-fold cross validation to choose the hyperparameters that produce the minimum mean squared error on the training set.

Performance metrics

We assessed model performance with the root mean squared error (RMSE), mean absolute error (MAE), and Pearson correlation (CORR).

RMSE= \sqrt{\frac{1}{n} \sum_{t = 1}^{n} {({\hat{y}}_{t} - y_{t})}^{2}} (4)

MAE = \frac{1}{n} \sum_{t = 1}^{n} | {\hat{y}}_{t} - y_{t} | (5)

CORR = Corr ({\hat{y}}_{t}, y_{t}) = \frac{\sum_{i = 1}^{t} ({\hat{y}}_{t} - {\bar{\hat{y}}}_{t}) (y_{t} - \bar{y})}{\sqrt{\sum_{i = 1}^{t} {({\hat{y}}_{t} - {\bar{\hat{y}}}_{t})}^{2} \sum_{i = 1}^{t} {(y_{t} - \bar{y})}^{2}}} (6)

The RMSE and MAE measure how well a model describes the historical influenza activity data while the CORR measures how well a model tracks the trends in the influenza activity data. We used the RMSE as the primary metric for assessing model performance because the objective function of each model minimizes the RMSE. A model that describes historical influenza activity well will have RMSE and MAE values closer to zero, while a model that tracks historical influenza activity trends well will have a CORR value closer to one. For each model, we calculated the performance metrics between predicted and official influenza activity over the test period from January 6th, 2014 to January 5th, 2018.

Benchmark models

In addition to computing performance metrics, we compared the performance of ARGO to three benchmark models. We implemented an autoregressive (AR) model based on historical influenza surveillance data, a penalized multivariate regression model that only relies on GT data, and the historical predictions from GFT which are only available for South Africa²⁸.

AR model. We constructed the AR model using historical influenza activity data from FluNet to nowcast weekly influenza activity. The AR model assumes that unobserved influenza activity in the current week depends on previously observed influenza activity by representing unobserved influenza activity at week t as a linear combination of previously observed influenza activity plus some noise. The AR model of order p can be represented mathematically as

y_{t} = α_{1} y_{t - 1} + α_{2} y_{t - 2} + \dots + α_{p} y_{t - p} + w_{t}, w_{t} \sim N (0, σ_{w}^{2}) (7)

where p𝜖ℕ and α₁, …, α_p ∈ ℝ such that α_p ≠ 0 are the autoregressive coefficients. As in the ARGO model, y_t in the AR model represents the historical influenza activity and the noise w_t is a Gaussian white noise process with mean zero and constant variance.

We used partial autocorrelograms (given in Extended data²⁶, Supplementary Figure 2) to choose the number of autoregressive terms p to include in the AR model for each country. Partial autocorrelograms plot the empirical partial autocorrelation $\hat{p} (h)$ for different autoregressive terms h where the partial autocorrelation refers to the correlation between a variable and its h-th autoregressive term after removing the linear dependence of the shorter autoregressive terms.

\hat{p} (h) = Corr (y_{t}, y_{t - h} | y_{t - 1}, \dots, y_{t - h + 1}) (8)

In other words, the partial autocorrelation between a variable and its h-th autoregressive term quantifies the degree to which they are linearly related after removing the impact of the shorter autoregressive terms. We modeled influenza activity Algeria, Ghana, Morocco, and South Africa with an AR(1) model because only the partial autocorrelation for the first autoregressive term was significantly different from zero for each country. We trained the AR model using linear regression with a moving time window to compare the AR model to the ARGO model. Since the ARGO model is an autoregressive model that leverages exogenous GT data, comparing the ARGO model to the AR model will demonstrate the additional predictive value of including the GT data.

GT model. We implemented a GT model that uses influenza-related Google search frequencies to predict influenza activity without taking historical influenza activity data into account. The GT model assumes that the observed frequencies of influenza-related Google search terms at week t can be used a proxy for influenza activity at week t by representing unobserved influenza activity at week t as a linear combination of the search frequencies of the Google search terms at week t plus some noise. The GT model can be represented mathematically as

y_{t} = β_{1} X_{1, t} + β_{2} X_{2, t} + \dots + β_{p} X_{p, t} + v_{t}, v_{t} \sim N (0, σ_{v}^{2}) (9)

where y_t represents historical influenza activity at time t, X_i,t represents the Google search frequency of term i at week t, and the noise v_t is a Gaussian white noise process with mean zero and constant variance. We trained the GT model using LASSO regression with a moving time window and dynamic variable transformations. Comparing the GT model to the ARGO model will demonstrate the predictive value of the GT data in the absence of autoregressive information.

GFT. GFT for South Africa, which launched in 2010 as a result of a collaboration between Google and the National Institute for Communicable Diseases (NICD), combined Google search data with influenza surveillance data from the NICD to track influenza activity in South Africa at the national level as well as the provincial level for the provinces of Gauteng, KwaZulu-Natal, and the Western Cape²⁸. Although GFT for South Africa was discontinued in 2015, the historical GFT estimates for South Africa from January 1st, 2006 to August 9th, 2015 remain publicly available²⁸. Since the estimates reported by GFT are not on the same scale as those reported by ARGO, we dynamically rescaled the GFT estimates to fit the historical influenza activity data from FluNet using the same moving time windows applied to the other models. Comparing historical GFT estimates to ARGO estimates will demonstrate whether ARGO avoids the pitfalls of GFT.

Results

Algeria

As shown in Table 1, the ARGO model outperforms the AR and GT models across every metric (RMSE, MAE, and CORR) from January 5, 2014 to January 3, 2016 in Algeria. The ARGO model yields a 7% lower RMSE than the AR model and a 35% lower RMSE than the GT model. As shown in Figure 1, the prediction error of each model increases during the periods leading up to and following the peak observed in the official CCR around May 2016 because the peak estimated by the GT model anticipates the peak observed in the official CCR while the peak estimated by the AR and ARGO models lags behind the peak observed in the official CCR. The GT model tends to overestimate the official CCR during periods of low volumes and underestimate the official CCR during periods of high volumes. The fact that the Google search data anticipate the peak in the official CCR reported by FluNet, unlike the AR and ARGO models, suggests that influenza-related Google searches could potentially provide valuable information but that the observed volume of influenza-related Google searches in Algeria may be too low to be informative. The heatmap of the ARGO coefficients for each predictor over time in Figure 1 shows that although the ARGO estimates consistently rely on the search frequency of ’grippe’ and the first autoregressive term, the weight of the first autoregressive term is always approximately twice that of the search frequency of ’grippe’. Furthermore, some influenza-related search terms such as ’vaccin’ and ’la grippe’ have negative coefficient values before December 2012. These results further confirm that the observed influenza-related Google searches do not provide valuable information for estimating the CCR in Algeria even though the ARGO model yields a slightly lower RMSE than the AR model.

Table 1. Comparison of performance metrics of the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models in South Africa, Algeria, Morocco, and Ghana.

For each metric, the bold values denote the best performance among all models. Google Flu Trends (GFT) estimates were only available for South Africa.

	Full Period			GFT Period
	RMSE	MAE	CORR	RMSE	MAE	CORR
South Africa
GFT	-	-	-	0.0046	0.0041	0.8571
AR	0.0017	0.001	0.905	0.002	0.0012	0.8854
GT	0.0022	0.0016	0.8857	0.0026	0.0021	0.8167
ARGO	0.0016	0.0011	0.911	0.0019	0.0012	0.8999
Algeria
AR	0.0061	0.0033	0.8541	-	-	-
GT	0.0088	0.006	0.6537	-	-	-
ARGO	0.0057	0.0032	0.8736	-	-	-
Ghana
AR	0.0016	0.0012	0.7218	-	-	-
GT	0.0023	0.0018	0.0542	-	-	-
ARGO	0.0016	0.0012	0.7012	-	-	-
Morocco
AR	0.0032	0.002	0.7796	-	-	-
GT	0.004	0.0031	0.6803	-	-	-
ARGO	0.0029	0.0021	0.8384	-	-	-

Figure 1. The estimated complete case ratio in Algeria from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

The complete case ratio from FluNet, the prediction errors associated with each model, and the heatmap of the ARGO coefficients over time are shown for reference.

Ghana

In Ghana, the ARGO and AR models outperform the GT model across all performance metrics from January 3, 2014 to December 31, 2017. The ARGO and AR models perform equally well according to the RMSE and MAE, but the AR model outperforms the ARGO model according to the CORR. The AR and ARGO models yield a 30% lower RMSE than the GT model which suggests that the Google search data do not provide valuable information for estimating the CCR in Ghana. As shown in Figure 2, the estimated CCR from the GT model over time does not correspond to the observed CCR and the ARGO estimates rely almost entirely on the autoregressive data which confirms the relative unimportance of the Google search data compared to the autoregressive data. These results are not surprising considering that most of the Google search terms for Ghana are related to non-influenza infections generating influenza-like symptoms, such as yellow fever and meningitis.

Figure 2. The estimated complete case ratio in Ghana from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

The complete case ratio from FluNet, the prediction errors associated with each model, and the heatmap of the ARGO coefficients over time are shown for reference.

Morocco

In Morocco, the ARGO model outperforms every benchmark model according to the RMSE and CORR while the AR model outperforms the ARGO and GT models according to the MAE from September 4, 2016 to December 31, 2017. The ARGO model yields a 9% lower RMSE than the AR model and a 28% lower RMSE than the GT model. Even though the heatmap of the ARGO coefficients over time in Figure 3 shows that ’vaccin’ and ’grippe’ have a non-zero weight over time, the weight of the first autoregressive term is consistently approximately twice that of ’vaccin’ and ’grippe’. In addition, the GT and ARGO models tend to overestimate the CCR during period of low volumes which may be a symptom of having small sample sizes in both the observed CCR values as well as the influenza-related Google searches in Morocco.

Figure 3. The estimated complete case ratio in Morocco from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

The complete case ratio from FluNet, the prediction errors associated with each model, and the heatmap of the ARGO coefficients over time are shown for reference.

South Africa

In South Africa, the ARGO model outperforms every benchmark model according to the RMSE and CORR while the AR model outperforms the AR and GT models according to the MAE from January 5, 2014 to December 3, 2017. In particular, the ARGO model outperforms GFT in South Africa during the period from January 5, 2014 to August 9, 2015 when GFT published influenza activity estimates for South Africa. The ARGO model yields a 6% lower RMSE than the AR model and a 27% lower RMSE than the GT model during the full period. During the GFT period, the ARGO model yields a 5% lower RMSE than the AR model, a 27% lower RMSE than the GT model, and a 59% lower RMSE than GFT. As shown in Figure 4, the GT model tends to underestimate the official CCR during periods of peak influenza activity and occasionally under- or overestimate the official CCR during periods of low influenza activity. The heatmap of the ARGO coefficients over time in Figure 4 shows that the relationship between the official CCR reported by FluNet and the Google search data changes over time. For example, the search frequency of ’swine flu’ was as important as the autoregressive terms for predicting the CCR from August 2015 to May 2017 but not as important afterwards. The heatmap of the ARGO coefficients also visualizes how the ARGO model reduces dimensionality through L₁regularization.

Figure 4. The estimated complete case ratio in South Africa from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

The complete case ratio from FluNet, the estimates from Google Flu Trends (GFT), the prediction errors associated with each model, and the heatmap of the ARGO coefficients over time are shown for reference.

Discussion

Influenza surveillance methods that provide accurate influenza activity estimates in near real-time can help public health agencies adapt quickly to unexpected changes in influenza activity to minimize the ever-present risk of an influenza pandemic. We demonstrated that ARGO, which was originally conceived by Yang et al. ¹⁷ to estimate influenza activity in near real-time in the United States, can be extended to monitor influenza activity in African countries. Specifically, our results show that ARGO outperforms every benchmark model according to the RMSE in South Africa, Algeria, and Morocco which are countries with clear seasonal influenza activity. However, in countries where reported influenza activity seemed erratic (i.e., with sharp changes in influenza case counts from week to week), ARGO predictions did not properly track the observed case counts. Another important observation to highlight in our modeling efforts is that the GT model, which only relies on synchronous Google search activity, can track the salient features of influenza activity in Algeria, Morocco and South Africa. This finding suggests that, in the absence of timely healthcare-based influenza estimates, public health officials could use near real-time estimates derived from Google search activity to get a sense of the unfolding trends in an ongoing outbreak.

The performance of the ARGO model is limited by the low sample size in both the observed historical influenza data and the influenza-related Google searches. As shown in Table 2, the median yearly NCC of South Africa, Algeria, Morocco, and Ghana ranges from 632 in Algeria to 7,388 in South Africa. Also, the number of influenza-related search terms identified in these countries ranges from 8 search terms in Morocco to 102 search terms in South Africa. To put that into perspective, the number of influenza-related search terms identified in the United States was 100¹⁷ and the median yearly NCC in the United States from 2014 to 2018 is at least twice that of South Africa. It is also worth noting that the study period of the historical influenza surveillance data and Google search data varied substantially between countries, from one year in Morocco to five years in South Africa, which affects the robustness of model predictions.

Table 2. Comparison of AutoRegression with Google search data (ARGO) performance measured by the Pearson correlation, median yearly number of confirmed cases of influenza, the presence of seasonal influenza activity, Internet penetration rates, Google market share, literacy rate, average population, country size, population density, GDP per capita, poverty headcount ratio, and conflict zone status in South Africa, Algeria, Morocco, and Ghana.

Characteristics	South Africa	Algeria	Morocco	Ghana
ARGO correlation	0.91	0.87	0.84	0.70
Median yearly case count	7,388	632	1,337	3,207
Seasonality	Yes	Yes	Yes	No
Internet penetration	54%	43%	58%	35%
Google market share	95%	97%	97%	89%
Literacy rate	99%	80%	69%	77%
Average population (millions)	57.5	42.2	35.7	28.3
Country size (10³mi²)	471	919.6	274.5	92.5
Population (millions / mi²)	47	15.9	50	262.9
GDP per capita	$13,840	$4,669	$8,959	$2,081
Poverty headcount ratio	56%	6%	15%	24%
Conflict zone	No	Yes	No	No

ARGO performed best in countries with seasonal influenza activity, increased Internet access, and higher sampling of both historical influenza and Google search data as shown in Table 2. However, it appears that the effect of Internet access on ARGO performance is modified by the literacy rate because ARGO performance is not as impressive in Morocco compared to South Africa even though both Morocco and South Africa have comparable Internet penetration rates and strong seasonality in influenza activity. Interestingly, although the ARGO model in Algeria did not perform particularly well, it was still able to capture two closely spaced peaks of influenza activity in the 2016-2017 influenza season in Ghana which suggests that this approach could work in countries with less defined seasonality. In Algeria, the historical influenza trends were unusual with several consecutive recent years without significant influenza activity which could be a symptom of sampling issues. We hypothesize that ARGO performance will improve when national disease surveillance systems adopt accurate and consistent reporting standards. Moreover, the ongoing increase of Internet penetration and literacy rates may also lead to better predictions in the near future.

Future work should explore expanding the Google Search data to include other languages commonly spoken in Africa such as Arabic in Northern Africa or indigenous languages of Sub-Saharan Africa. It would also be interesting to experiment with assigning different L₁ penalties to the historical influenza data and the influenza-related Google search data given that the data originate from different sources. Specifically, the performance of the ARGO model may be improved by penalizing the exogenous Google search data while not necessarily penalizing the autoregressive terms that were identified as having a significant effect on the estimation of future influenza activity as done by Yang et al.²⁹. Alternatively, future work should consider restricting the L₁ penalties so that the variables that are consistently identified as having a significant effect on the estimation of influenza activity are constrained to have a fraction of the penalty applied to the rest of the terms as done by Lu et al.^20,21.

Furthermore, future work should assess the robustness of the ARGO model to variability in the Google search data by implementing the ARGO model using multiple samples of the Google search data. As shown in Yang et al.²⁹, different samples of the Google search data can be obtained by downloading the Google search data from GHT at different times of the week since the search activity data from GHT are based on a sample of all Google searches. Also, since influenza activity data aggregated at the national level are not necessarily representative of actual influenza activity at the local level, future work could investigate the feasibility of implementing ARGO at smaller geographic scales as shown by Lu et al.^20,21 for state-level predictions in the USA and the Boston metropolis.

In conclusion, we find that the ARGO algorithm provided modest but significant improvement in monitoring influenza activity in four African countries compared to models that did not include Google search queries. It will be interesting to see how the performances of digital or social media-based surveillance approaches evolve over time as the quality of influenza surveillance data improves and Internet coverage expands. Further research could test whether these algorithms may be useful in monitoring emerging infections that may be perceived as more urgent health threats than influenza in Africa and other low-income settings. We hypothesize that ARGO performance will improve when national disease surveillance systems adopt accurate and consistent reporting standards. Moreover, the ongoing increase of Internet penetration and literacy rates may also lead to better predictions in the near future.

Data availability

Underlying data

Harvard Dataverse: Historical influenza activity and influenza-related Google searches in Algeria, Ghana, Morocco, and South Africa, 2012-2017. https://doi.org/10.7910/DVN/9GPUWH²⁴.

This project contains the following underlying data:

Algeria_data (data for Algeria; please see the notes field of the record for a description of each heading)
Ghana_data (data for Ghana)
Morocco_data (data for Morocco)
South_africa_data (data for South Africa)

Extended data

Harvard Dataverse: Supplementary Data for: Leveraging Google Search Data to Track Influenza Outbreaks in Africa Mejia K, Viboud C and Santillana M. https://doi.org/10.7910/DVN/UOVT7E²⁶.

This project contains the following extended data:

Algeria_pacf (partial autocorrelogram for Algeria)
Flunet_heatmap (number of processed specimens reported from January 8, 2012 to October 7, 2018 for each country)
Ghana_pacf (partial autocorrelogram for Ghana)
Morocco_pacf (partial autocorrelogram for Morocco)
Search_terms (Google search terms for Algeria, Ghana, Morocco and South Africa)
South_africa_pacf (partial autocorrelogram for South Africa)
Supplementary_Data (containing Supplementary Figures 1 and 2, and Supplementary Tables 1–4)
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Acknowledgements

We are grateful to the Google Health Trends team for granting us access to the API.

Disclaimer

This study does not necessarily represent the views of the NIH or the US government.

F1000 recommended

References

1. Iuliano AD, Roguski KM, Chang HH, et al.: Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. Lancet. 2018; 391(10127): 1285–1300. PubMed Abstract | Publisher Full Text | Free Full Text
2. Ortiz JR, Perut M, Dumolard L, et al.: A global review of national influenza immunization policies: Analysis of the 2014 WHO/UNICEF Joint Reporting Form on immunization. Vaccine. 2016; 34(45): 5400–5405. PubMed Abstract | Publisher Full Text | Free Full Text
3. Aramaki E, Maskawa S, Morita M: Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of the conference on empirical methods in natural language processing. (Association for Computational Linguistics). 2011; 1568–1576. Reference Source
4. Paul MJ, Dredze M, Broniatowski D: Twitter improves influenza forecasting. PLoS Curr. 2014; 6: pii: ecurrents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. PubMed Abstract | Publisher Full Text | Free Full Text
5. Ginsberg J, Mohebbi MH, Patel RS, et al.: Detecting influenza epidemics using search engine query data. Nature. 2009; 457(7232): 1012. PubMed Abstract | Publisher Full Text
6. Signorini A, Segre AM, Polgreen PM: The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS One. 2011; 6(5): e19467. PubMed Abstract | Publisher Full Text | Free Full Text
7. Achrekar H, Gandhe A, Lazarus R, et al.: Predicting flu trends using twitter data. In 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). 2011; 702–707. Reference Source
8. Polgreen P, Chen Y, Pennock DM, et al.: Using internet searches for influenza surveillance. Clin Infect Dis. 2008; 47(11): 1443–8. PubMed Abstract | Publisher Full Text
9. Yang S, Santillana M, Brownstein JS, et al.: Using electronic health records and Internet search information for accurate influenza forecasting. BMC Infect Dis. 2017; 17(1): 332. PubMed Abstract | Publisher Full Text | Free Full Text
10. Santillana M, Nguyen AT, Louie T, et al.: Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance. Sci Rep. 2016; 6: 25732. PubMed Abstract | Publisher Full Text | Free Full Text
11. McIver DJ, Brownstein JS: Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS Comput Biol 2014; 10(4): e1003581. PubMed Abstract | Publisher Full Text | Free Full Text
12. Cook S, Conrad C, Fowlkes AL, et al.: Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One. 2011; 6(8): e23610. PubMed Abstract | Publisher Full Text | Free Full Text
13. Butler D: When Google got flu wrong. Nature. 2013; 494(7436): 155. PubMed Abstract | Publisher Full Text
14. Lazer D, Kennedy R, King G, et al.: Big data. The parable of Google Flu: traps in big data analysis. Science. 2014; 343(6176): 1203–1205. PubMed Abstract | Publisher Full Text
15. Santillana M, Zhang DW, Althouse BM, et al.: What can digital disease detection learn from (an external revision to) Google Flu Trends? Am J Prev Med. 2014; 47(3): 341–347. PubMed Abstract | Publisher Full Text
16. Santillana M: Editorial Commentary: Perspectives on the Future of Internet Search Engines and Biosurveillance Systems. Clin Infect Dis. 2017; 64(1): 42–43. PubMed Abstract | Publisher Full Text
17. Yang S, Santillana M, Kou SC: Accurate estimation of influenza epidemics using Google search data via ARGO. Proc Natl Acad Sci U S A. 2015; 112(47): 14473–14478. PubMed Abstract | Publisher Full Text | Free Full Text
18. Lampos V, Miller AC, Crossan S, et al.: Advances in nowcasting influenza-like illness rates using search query logs. Sci Rep. 2015; 5: 12760. PubMed Abstract | Publisher Full Text | Free Full Text
19. Xu Q, Gel YR, Ramirez Ramirez LL, et al.: Forecasting influenza in Hong Kong with Google search queries and statistical model fusion. PLoS One. 2017; 12(5): e0176690. PubMed Abstract | Publisher Full Text | Free Full Text
20. Lu FS, Hou S, Baltrusaitis K, et al.: Accurate Influenza Monitoring and Forecasting Using Novel Internet Data Streams: A Case Study in the Boston Metropolis. JMIR Public Health Surveill. 2018; 4(1): e4. PubMed Abstract | Publisher Full Text | Free Full Text
21. Lu FS, Hattab MW, Clemente CL, et al.: Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nat Commun. 2019; 10(1): 147. PubMed Abstract | Publisher Full Text | Free Full Text
22. Clemente L, Lu F, Santillana M: Improved real-time influenza surveillance using internet search data in eight latin american countries. bioRxiv. 2018; 418475. Publisher Full Text
23. International Telecommunications Union: Measuring the Information Society Report. 2018; 1. Accessed June 24, 2019. Reference Source
24. Mejia K: Historical influenza activity and influenza-related Google searches in Algeria, Ghana, Morocco, and South Africa, 2012-2017. 2019. http://www.doi.org/10.7910/DVN/9GPUWH
25. World Health Organization: FluNet Database. 2018; Accessed October 15, 2018. Reference Source
26. Mejia K: Supplementary Data for: Leveraging Google Search Data to Track Influenza Outbreaks in Africa Mejia K,Viboud C and Santillana M. 2019. http://www.doi.org/10.7910/DVN/UOVT7E
27. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in python. J Mach Learn Res. 2011; 12: 2825–2830. Reference Source
28. Google Inc: Google Flu Trends - South Africa (Experimental). 2015; Retrieved November 7, 2018. Reference Source
29. Yang S, Kou SC, Lu F, et al.: Advances in using Internet searches to track dengue. PLoS Comput Biol. 2017; 13(7): e1005607. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 31 Oct 2019

Author details Author details

¹ Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, 02115, USA
² Fogarty International Center, National Institutes of Health, Bethesda, Maryland, 20892, USA
³ Department of Pedriatics, Harvard Medical School, Boston, Massachusetts, 02115, USA
⁴ Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, 02115, USA

Karla Mejía
Roles: Data Curation, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Cecile Viboud
Roles: Funding Acquisition, Writing – Review & Editing

Mauricio Santillana
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The study was funded in part by the Bill and Melinda Gates Foundation (OPP 1195154).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 31 Oct 2019, 3:1653

https://doi.org/10.12688/gatesopenres.13072.1

Copyright

© 2019 Mejía K et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.

Download

Export To

metrics

	Views	Downloads
Gates Open Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Mejía K, Viboud C and Santillana M. Leveraging Google search data to track influenza outbreaks in Africa [version 1; peer review: 1 approved, 1 not approved] Gates Open Res 2019, 3:1653 (https://doi.org/10.12688/gatesopenres.13072.1)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 31 Oct 2019

Views

20

Reviewer Report 19 Feb 2020

Simon Moura, Department of Computer Science, University College London, London, UK

Vasileios Lampos, Department of Computer Science, University College London, London, UK

Not Approved

https://doi.org/10.21956/gatesopenres.14208.r28387

Summary of the article
The paper presents an analysis of Google-based autoregressive techniques for nowcasting influenza cases in 4 African countries. There is a focus on using ARGO¹ for this purpose. The authors claim that Google-search data improve accuracy ... Continue reading

Summary of the article
The paper presents an analysis of Google-based autoregressive techniques for nowcasting influenza cases in 4 African countries. There is a focus on using ARGO¹ for this purpose. The authors claim that Google-search data improve accuracy in all but one country.

Specific comments
This is a well-written article focusing on a specific applied task that previously published papers have not covered (online search based models for flu for African countries). To this end, this is a very interesting paper and a very useful undertaking. However, at its current version, the paper makes claims that the experimental results are not supporting and, from our point of view, further work is required for it to become scientifically sound. More details justifying this are provided below.

MAJOR ISSUES
* Are sufficient details of methods and analysis provided to allow replication by others?
* Is the study design appropriate and is the work technically sound?
* Is the statistical analysis and its interpretation appropriate?
* Are the conclusions drawn adequately supported by the results?

(1) RMSE vs. MAE and accuracy estimates:
From Simon Moura: "The authors report 3 different metrics (RMSE, MAE and Pearson correlation). When comparing results, the authors are mainly referring to RMSE to say which model performs the best and conclude saying that the ARGO model (the proposed model using both autoregression and Google searches) outperforms the other models in 3 cases out of 4. However, if we would take MAE as the main performance metric, the AR model (without using Google searches) is outperforming ARGO in 2 cases out of 4 (South Africa +10% and Morocco +5%) and performs equally good for another country (Ghana) while under-performing only slightly for the last country (Algeria -3.1%). So, if we would pick MAE as the main metric to describe the results, the conclusion that using both components, Google searches and an autoregressive part, wouldn't hold anymore."

In addition, RMSE penalises larger errors more, whereas MAE treats all errors uniformly. Both metrics are useful for evaluating flu models, and I believe that improved methods should be doing better under both MAE and RMSE. Another thing that is missing here is an estimation of the statistical significance of the difference in performance between the AR (being the most competitive) and the ARGO model.

(2) Is one week of delay for obtaining flu estimates from healthcare endpoints realistic for these countries? The authors assume this to be the case, but provide not proof for that. As this is an applied paper with a specific purpose, I believe that the authors should make an effort to reproduce a realistic setting and adequately justify such choices. Also: delaying past influenza rates by 2 weeks, might signify the contribution of Google data and make the claims of the authors more relevant.

(3) Detail is missing about the training of the models. My preference would be to identify specific periods for testing (say 2-3 52-week periods based on data availability in each country), train a model on past data and evaluate on each period separately. In addition, Simon Moura wrote: "The authors present clearly the ARGO model. However, I would like to have more precision about the 10-fold cross validation method used. How are the folds created?"

(4) I think it is necessary to provide more details in the main paper about the number of queries identified for each country, as well as the number of queries that remained active (nonzero weight), given the fact that there is an L1-norm regularisation in ARGO (and lasso).

(5) I believe that the baselines used are not very strong. First, lasso will underperform when collinear predictors are present (and frequency time series will be collinear for queries about flu). I would recommend the use of elastic net² instead or if the query space is quite small (<= 10 queries) even the use of ridge regression. Second, I would also like to see a comparison between the AR model and a naive persistence model to further understand the value of this baseline. Third, there are different models proposed for this task in the literature, e.g. based on Gaussian Process regression³, that tend to perform even better, but I understand this might be out-of-scope.

(6) From the figures, it becomes evident that some of the models yielded negative flu rate estimates. That is not realistic, and hence, model outputs should have been thresholded to 0, prior to plotting, but most importantly, prior to estimating accuracy. An alternative solution for this is to convert influenza rates from 0 to 1 and apply a logit transformation.

MINOR ISSUES
* Is the work clearly and accurately presented and does it cite the current literature?

(1) The work is accurately presented, but we noticed that some citations do not follow a chronological merit function. In particular, the works by Culotta⁴ and Lampos & Cristianini⁵ and⁶) on modelling flu from Twitter data were accepted and published before (in 2010) the referenced ones in the paper (references 3, 4, 6, and 7 in the main paper were published from 2011 onwards).

OTHER COMMENTS
(1) The authors might want to inform the readers that Google Correlate has been decommissioned by Google as of December 2019.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Yang S, Santillana M, Kou SC: Accurate estimation of influenza epidemics using Google search data via ARGO.Proc Natl Acad Sci U S A. 2015; 112 (47): 14473-8 PubMed Abstract | Publisher Full Text
2. Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005; 67 (2): 301-320 Publisher Full Text
3. Lampos V, Zou B, Cox IJ: Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance. ACM. 2017. Publisher Full Text | Reference Source
4. Culotta A: Towards detecting influenza epidemics by analyzing Twitter messages. ACM. 2010. Publisher Full Text | Reference Source
5. Lampos V, Cristianini N: Tracking the flu pandemic by monitoring the social web. IEEE. 2010. Publisher Full Text | Reference Source
6. Lampos V, De Bie T, Cristianini N: Flu Detector - Tracking Epidemics on Twitter. 2010; 6323: 599-602 Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Computer Science (natural language processing, machine learning), Computational health (focus on non-traditional disease surveillance methods)

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Respond or Comment

Views

9

Reviewer Report 06 Feb 2020

Marcelo F.C. Gomes, Scientific Computing Program, Oswaldo Cruz Foundation (Fiocruz), Rio de Janeiro, Brazil

Approved

https://doi.org/10.21956/gatesopenres.14208.r28388

The present article outlines a straightforward method to make use of google searches (GS) to enhance the accuracy of autoregressive (AR) models for Influenza activity forecast in a few African Countries. The proposed model can be easily translated to other ... Continue reading

The present article outlines a straightforward method to make use of google searches (GS) to enhance the accuracy of autoregressive (AR) models for Influenza activity forecast in a few African Countries. The proposed model can be easily translated to other localities as long as there is enough search volume. It also avoids Google Trends (GT) pitfalls by combining the online data with actual case counts and manually filtering out spurious terms that would otherwise be included on an unsupervised trend correlation algorithm.
This is clearly of great interest to public health authorities since it offers a relatively simple to implement tool of great importance for disease surveillance, particularly so for endemic ones, since it enables contingency plan actions on a timely manner.

On the other hand, as proposed it requires that the local data on case counts be collected and reported almost at time of occurrence, at most being made available by the end of each epidemiological week. In many countries, and for many diseases, this is not the case since there are several factors affecting it: characteristic time for infected individuals to seek a health care unit (HCU) after first symptoms, dedicated staff at HCU for filling the report and uploading it to central database (although digital notification systems might be commonplace in most developed countries, manual notifications are still a reality on several others), shipping to and processing samples at laboratories, and so on.

Nonetheless, it is transparent that the method can be adapted to incorporate the output of nowcasting methods that offer corrections for this particular delay process (backfill of incomplete weekly counts, see references [1-6] for examples), so that it does not loose validity and relevance, although it should affect its accuracy and confidence intervals.

As for GT in particular, the results shown highlight how using this data stream alone can lead to errors on the long run, as reported in previous studies and mentioned by the authors. The heatmaps reporting the effect of each term over time makes it clear how volatile it is and how terms that eventually are correlated only at a given time window end up being part of the algorithm. For example, we see that malaria, a disease that only shares fever as common symptom with the Flu and which can have seasonality not at all correlated with Influenza depending on local climate characteristics, is inserted as relevant term in Ghana, having a relatively strong effect from mid-2016 to mid-2017 only. The same can be said by the observation of similar terms having opposite effects at the same time window (e.g., “grippe” and “la grippe”, both meaning influenza-like illness or cold for the general population, present in Morocco and Algeria’s models). As pointed out in the text, the strategy used by the authors of incorporating an AR greatly diminishes those effects since the lag term dominates the parameter space, dramatically outperforming GT prediction, while the frequent recalibration and short-term prediction window used offer smaller but perceptible accuracy increase with respect to a pure AR model, so that parsimoniously incorporating GS data is a benefit.

Minor issues:

The bottom panel of Figures 1-4 present the time series for "Prediction error", but there is no mention on the text what is the specific metric used. Readers might guess that it is simply y_t-y_t^{hat} but it would be better to have it explicitly mentioned in the text or figure caption.
On the same topic, it appears that the prediction error calculated for Ghana (Fig.2) has been repeated for Algeria (Fig.1).

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Salmon M, Schumacher D, Stark K, Höhle M: Bayesian outbreak detection in the presence of reporting delays.Biom J. 2015; 57 (6): 1051-67 PubMed Abstract | Publisher Full Text
2. Barbosa MT, Struchiner CJ: The estimated magnitude of AIDS in Brazil: a delay correction applied to cases with lost dates.Cad Saude Publica. 18 (1): 279-85 PubMed Abstract | Publisher Full Text
3. Noufaily A, Farrington P, Garthwaite P, Enki D, et al.: Detection of Infectious Disease Outbreaks From Laboratory Data With Reporting Delays. Journal of the American Statistical Association. 2016; 111 (514): 488-499 Publisher Full Text
4. Bastos LS, Economou T, Gomes MFC, Villela DAM, et al.: A modelling approach for correcting reporting delays in disease surveillance data.Stat Med. 2019; 38 (22): 4363-4377 PubMed Abstract | Publisher Full Text
5. Höhle M, an der Heiden M: Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011.Biometrics. 2014; 70 (4): 993-1002 PubMed Abstract | Publisher Full Text
6. Farrington C, Andrews N, Beale A, Catchpole M: A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease. Journal of the Royal Statistical Society. Series A (Statistics in Society). 1996; 159 (3). Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: disease surveillance, mathematical and computational models for epidemiology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 31 Oct 2019

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 31 Oct 19	read	read

Marcelo F.C. Gomes, Oswaldo Cruz Foundation (Fiocruz), Rio de Janeiro, Brazil
Simon Moura, University College London, London, UK

Vasileios Lampos, University College London, London, UK

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Back to all reports

Reviewer Report

20 Views

19 Feb 2020 | for Version 1

Simon Moura, Department of Computer Science, University College London, London, UK

Vasileios Lampos, Department of Computer Science, University College London, London, UK

20 Views Cite this report Responses(0)

Not Approved

Summary of the article
The paper presents an analysis of Google-based autoregressive techniques for nowcasting influenza cases in 4 African countries. There is a focus on using ARGO¹ for this purpose. The authors claim that Google-search data improve accuracy in all but one country.

Specific comments
This is a well-written article focusing on a specific applied task that previously published papers have not covered (online search based models for flu for African countries). To this end, this is a very interesting paper and a very useful undertaking. However, at its current version, the paper makes claims that the experimental results are not supporting and, from our point of view, further work is required for it to become scientifically sound. More details justifying this are provided below.

MAJOR ISSUES
* Are sufficient details of methods and analysis provided to allow replication by others?
* Is the study design appropriate and is the work technically sound?
* Is the statistical analysis and its interpretation appropriate?
* Are the conclusions drawn adequately supported by the results?

(1) RMSE vs. MAE and accuracy estimates:
From Simon Moura: "The authors report 3 different metrics (RMSE, MAE and Pearson correlation). When comparing results, the authors are mainly referring to RMSE to say which model performs the best and conclude saying that the ARGO model (the proposed model using both autoregression and Google searches) outperforms the other models in 3 cases out of 4. However, if we would take MAE as the main performance metric, the AR model (without using Google searches) is outperforming ARGO in 2 cases out of 4 (South Africa +10% and Morocco +5%) and performs equally good for another country (Ghana) while under-performing only slightly for the last country (Algeria -3.1%). So, if we would pick MAE as the main metric to describe the results, the conclusion that using both components, Google searches and an autoregressive part, wouldn't hold anymore."

In addition, RMSE penalises larger errors more, whereas MAE treats all errors uniformly. Both metrics are useful for evaluating flu models, and I believe that improved methods should be doing better under both MAE and RMSE. Another thing that is missing here is an estimation of the statistical significance of the difference in performance between the AR (being the most competitive) and the ARGO model.

(2) Is one week of delay for obtaining flu estimates from healthcare endpoints realistic for these countries? The authors assume this to be the case, but provide not proof for that. As this is an applied paper with a specific purpose, I believe that the authors should make an effort to reproduce a realistic setting and adequately justify such choices. Also: delaying past influenza rates by 2 weeks, might signify the contribution of Google data and make the claims of the authors more relevant.

(3) Detail is missing about the training of the models. My preference would be to identify specific periods for testing (say 2-3 52-week periods based on data availability in each country), train a model on past data and evaluate on each period separately. In addition, Simon Moura wrote: "The authors present clearly the ARGO model. However, I would like to have more precision about the 10-fold cross validation method used. How are the folds created?"

(4) I think it is necessary to provide more details in the main paper about the number of queries identified for each country, as well as the number of queries that remained active (nonzero weight), given the fact that there is an L1-norm regularisation in ARGO (and lasso).

(5) I believe that the baselines used are not very strong. First, lasso will underperform when collinear predictors are present (and frequency time series will be collinear for queries about flu). I would recommend the use of elastic net² instead or if the query space is quite small (<= 10 queries) even the use of ridge regression. Second, I would also like to see a comparison between the AR model and a naive persistence model to further understand the value of this baseline. Third, there are different models proposed for this task in the literature, e.g. based on Gaussian Process regression³, that tend to perform even better, but I understand this might be out-of-scope.

(6) From the figures, it becomes evident that some of the models yielded negative flu rate estimates. That is not realistic, and hence, model outputs should have been thresholded to 0, prior to plotting, but most importantly, prior to estimating accuracy. An alternative solution for this is to convert influenza rates from 0 to 1 and apply a logit transformation.

MINOR ISSUES
* Is the work clearly and accurately presented and does it cite the current literature?

(1) The work is accurately presented, but we noticed that some citations do not follow a chronological merit function. In particular, the works by Culotta⁴ and Lampos & Cristianini⁵ and⁶) on modelling flu from Twitter data were accepted and published before (in 2010) the referenced ones in the paper (references 3, 4, 6, and 7 in the main paper were published from 2011 onwards).

OTHER COMMENTS
(1) The authors might want to inform the readers that Google Correlate has been decommissioned by Google as of December 2019.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

References

1. Yang S, Santillana M, Kou SC: Accurate estimation of influenza epidemics using Google search data via ARGO.Proc Natl Acad Sci U S A. 2015; 112 (47): 14473-8 PubMed Abstract | Publisher Full Text
2. Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005; 67 (2): 301-320 Publisher Full Text
3. Lampos V, Zou B, Cox IJ: Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance. ACM. 2017. Publisher Full Text | Reference Source
4. Culotta A: Towards detecting influenza epidemics by analyzing Twitter messages. ACM. 2010. Publisher Full Text | Reference Source
5. Lampos V, Cristianini N: Tracking the flu pandemic by monitoring the social web. IEEE. 2010. Publisher Full Text | Reference Source
6. Lampos V, De Bie T, Cristianini N: Flu Detector - Tracking Epidemics on Twitter. 2010; 6323: 599-602 Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Computer Science (natural language processing, machine learning), Computational health (focus on non-traditional disease surveillance methods)

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

9 Views

06 Feb 2020 | for Version 1

Marcelo F.C. Gomes, Scientific Computing Program, Oswaldo Cruz Foundation (Fiocruz), Rio de Janeiro, Brazil

9 Views Cite this report Responses(0)

Approved

The present article outlines a straightforward method to make use of google searches (GS) to enhance the accuracy of autoregressive (AR) models for Influenza activity forecast in a few African Countries. The proposed model can be easily translated to other localities as long as there is enough search volume. It also avoids Google Trends (GT) pitfalls by combining the online data with actual case counts and manually filtering out spurious terms that would otherwise be included on an unsupervised trend correlation algorithm.
This is clearly of great interest to public health authorities since it offers a relatively simple to implement tool of great importance for disease surveillance, particularly so for endemic ones, since it enables contingency plan actions on a timely manner.

On the other hand, as proposed it requires that the local data on case counts be collected and reported almost at time of occurrence, at most being made available by the end of each epidemiological week. In many countries, and for many diseases, this is not the case since there are several factors affecting it: characteristic time for infected individuals to seek a health care unit (HCU) after first symptoms, dedicated staff at HCU for filling the report and uploading it to central database (although digital notification systems might be commonplace in most developed countries, manual notifications are still a reality on several others), shipping to and processing samples at laboratories, and so on.

Nonetheless, it is transparent that the method can be adapted to incorporate the output of nowcasting methods that offer corrections for this particular delay process (backfill of incomplete weekly counts, see references [1-6] for examples), so that it does not loose validity and relevance, although it should affect its accuracy and confidence intervals.

As for GT in particular, the results shown highlight how using this data stream alone can lead to errors on the long run, as reported in previous studies and mentioned by the authors. The heatmaps reporting the effect of each term over time makes it clear how volatile it is and how terms that eventually are correlated only at a given time window end up being part of the algorithm. For example, we see that malaria, a disease that only shares fever as common symptom with the Flu and which can have seasonality not at all correlated with Influenza depending on local climate characteristics, is inserted as relevant term in Ghana, having a relatively strong effect from mid-2016 to mid-2017 only. The same can be said by the observation of similar terms having opposite effects at the same time window (e.g., “grippe” and “la grippe”, both meaning influenza-like illness or cold for the general population, present in Morocco and Algeria’s models). As pointed out in the text, the strategy used by the authors of incorporating an AR greatly diminishes those effects since the lag term dominates the parameter space, dramatically outperforming GT prediction, while the frequent recalibration and short-term prediction window used offer smaller but perceptible accuracy increase with respect to a pure AR model, so that parsimoniously incorporating GS data is a benefit.

Minor issues:

The bottom panel of Figures 1-4 present the time series for "Prediction error", but there is no mention on the text what is the specific metric used. Readers might guess that it is simply y_t-y_t^{hat} but it would be better to have it explicitly mentioned in the text or figure caption.
On the same topic, it appears that the prediction error calculated for Ghana (Fig.2) has been repeated for Algeria (Fig.1).

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Salmon M, Schumacher D, Stark K, Höhle M: Bayesian outbreak detection in the presence of reporting delays.Biom J. 2015; 57 (6): 1051-67 PubMed Abstract | Publisher Full Text
2. Barbosa MT, Struchiner CJ: The estimated magnitude of AIDS in Brazil: a delay correction applied to cases with lost dates.Cad Saude Publica. 18 (1): 279-85 PubMed Abstract | Publisher Full Text
3. Noufaily A, Farrington P, Garthwaite P, Enki D, et al.: Detection of Infectious Disease Outbreaks From Laboratory Data With Reporting Delays. Journal of the American Statistical Association. 2016; 111 (514): 488-499 Publisher Full Text
4. Bastos LS, Economou T, Gomes MFC, Villela DAM, et al.: A modelling approach for correcting reporting delays in disease surveillance data.Stat Med. 2019; 38 (22): 4363-4377 PubMed Abstract | Publisher Full Text
5. Höhle M, an der Heiden M: Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011.Biometrics. 2014; 70 (4): 993-1002 PubMed Abstract | Publisher Full Text
6. Farrington C, Andrews N, Beale A, Catchpole M: A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease. Journal of the Royal Statistical Society. Series A (Statistics in Society). 1996; 159 (3). Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

disease surveillance, mathematical and computational models for epidemiology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

[1] 1. Iuliano AD, Roguski KM, Chang HH, et al.: Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. Lancet. 2018; 391(10127): 1285–1300. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Ortiz JR, Perut M, Dumolard L, et al.: A global review of national influenza immunization policies: Analysis of the 2014 WHO/UNICEF Joint Reporting Form on immunization. Vaccine. 2016; 34(45): 5400–5405. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Aramaki E, Maskawa S, Morita M: Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of the conference on empirical methods in natural language processing. (Association for Computational Linguistics). 2011; 1568–1576. Reference Source

[4] 4. Paul MJ, Dredze M, Broniatowski D: Twitter improves influenza forecasting. PLoS Curr. 2014; 6: pii: ecurrents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Ginsberg J, Mohebbi MH, Patel RS, et al.: Detecting influenza epidemics using search engine query data. Nature. 2009; 457(7232): 1012. PubMed Abstract | Publisher Full Text

[6] 6. Signorini A, Segre AM, Polgreen PM: The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic. PLoS One. 2011; 6(5): e19467. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Achrekar H, Gandhe A, Lazarus R, et al.: Predicting flu trends using twitter data. In 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). 2011; 702–707. Reference Source

[8] 8. Polgreen P, Chen Y, Pennock DM, et al.: Using internet searches for influenza surveillance. Clin Infect Dis. 2008; 47(11): 1443–8. PubMed Abstract | Publisher Full Text

[9] 9. Yang S, Santillana M, Brownstein JS, et al.: Using electronic health records and Internet search information for accurate influenza forecasting. BMC Infect Dis. 2017; 17(1): 332. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Santillana M, Nguyen AT, Louie T, et al.: Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance. Sci Rep. 2016; 6: 25732. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. McIver DJ, Brownstein JS: Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS Comput Biol 2014; 10(4): e1003581. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Cook S, Conrad C, Fowlkes AL, et al.: Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One. 2011; 6(8): e23610. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Butler D: When Google got flu wrong. Nature. 2013; 494(7436): 155. PubMed Abstract | Publisher Full Text

[14] 14. Lazer D, Kennedy R, King G, et al.: Big data. The parable of Google Flu: traps in big data analysis. Science. 2014; 343(6176): 1203–1205. PubMed Abstract | Publisher Full Text

[15] 15. Santillana M, Zhang DW, Althouse BM, et al.: What can digital disease detection learn from (an external revision to) Google Flu Trends? Am J Prev Med. 2014; 47(3): 341–347. PubMed Abstract | Publisher Full Text

[16] 16. Santillana M: Editorial Commentary: Perspectives on the Future of Internet Search Engines and Biosurveillance Systems. Clin Infect Dis. 2017; 64(1): 42–43. PubMed Abstract | Publisher Full Text

[17] 17. Yang S, Santillana M, Kou SC: Accurate estimation of influenza epidemics using Google search data via ARGO. Proc Natl Acad Sci U S A. 2015; 112(47): 14473–14478. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Lampos V, Miller AC, Crossan S, et al.: Advances in nowcasting influenza-like illness rates using search query logs. Sci Rep. 2015; 5: 12760. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Xu Q, Gel YR, Ramirez Ramirez LL, et al.: Forecasting influenza in Hong Kong with Google search queries and statistical model fusion. PLoS One. 2017; 12(5): e0176690. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Lu FS, Hou S, Baltrusaitis K, et al.: Accurate Influenza Monitoring and Forecasting Using Novel Internet Data Streams: A Case Study in the Boston Metropolis. JMIR Public Health Surveill. 2018; 4(1): e4. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Lu FS, Hattab MW, Clemente CL, et al.: Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nat Commun. 2019; 10(1): 147. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Clemente L, Lu F, Santillana M: Improved real-time influenza surveillance using internet search data in eight latin american countries. bioRxiv. 2018; 418475. Publisher Full Text

[23] 23. International Telecommunications Union: Measuring the Information Society Report. 2018; 1. Accessed June 24, 2019. Reference Source

[24] 24. Mejia K: Historical influenza activity and influenza-related Google searches in Algeria, Ghana, Morocco, and South Africa, 2012-2017. 2019. http://www.doi.org/10.7910/DVN/9GPUWH

[25] 25. World Health Organization: FluNet Database. 2018; Accessed October 15, 2018. Reference Source

[26] 26. Mejia K: Supplementary Data for: Leveraging Google Search Data to Track Influenza Outbreaks in Africa Mejia K,Viboud C and Santillana M. 2019. http://www.doi.org/10.7910/DVN/UOVT7E

[27] 27. Pedregosa F, Varoquaux G, Gramfort A, et al.: Scikit-learn: Machine learning in python. J Mach Learn Res. 2011; 12: 2825–2830. Reference Source

[28] 28. Google Inc: Google Flu Trends - South Africa (Experimental). 2015; Retrieved November 7, 2018. Reference Source

[29] 29. Yang S, Kou SC, Lu F, et al.: Advances in using Internet searches to track dengue. PLoS Comput Biol. 2017; 13(7): e1005607. PubMed Abstract | Publisher Full Text | Free Full Text

Leveraging Google search data to track influenza outbreaks in Africa

Abstract

Keywords

Introduction

Our contribution

Methods

Influenza surveillance data

Google search data

Models

ARGO model

Performance metrics

Benchmark models

Results

Algeria

Table 1. Comparison of performance metrics of the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models in South Africa, Algeria, Morocco, and Ghana.

Figure 1. The estimated complete case ratio in Algeria from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Ghana

Figure 2. The estimated complete case ratio in Ghana from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Morocco

Figure 3. The estimated complete case ratio in Morocco from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

South Africa

Figure 4. The estimated complete case ratio in South Africa from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Discussion

Data availability

Underlying data

Extended data

Acknowledgements

Disclaimer

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Are you a Gates-funded researcher?

Thank you!

Leveraging Google search data to track influenza outbreaks in Africa

Abstract

Keywords

Introduction

Our contribution

Methods

Influenza surveillance data

Google search data

Models

ARGO model

Performance metrics

Benchmark models

Results

Algeria

Table 1. Comparison of performance metrics of the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models in South Africa, Algeria, Morocco, and Ghana.

Figure 1. The estimated complete case ratio in Algeria from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Ghana

Figure 2. The estimated complete case ratio in Ghana from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Morocco

Figure 3. The estimated complete case ratio in Morocco from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

South Africa

Figure 4. The estimated complete case ratio in South Africa from the AutoRegression with Google search data (ARGO), Google Trends (GT), and autoregressive (AR) models.

Discussion

Data availability

Underlying data

Extended data

Acknowledgements

Disclaimer

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Competing Interests Policy

Stay Updated

Are you a Gates-funded researcher?

Thank you!