Poirier C, Bouzillé G, Bertaud V, Cuggia M, Santillana M, Lavenu A. Big data to predict gastroenteritis outbreaks. Submitted.Abstract

Background: Accute gastroenteritis (AG) is a major public health issue. To reduce impact and to organize adaptedsanitary responses, traditional surveillance produce estimates but with 1- to 3-week delay. The main challenge is toproduce near real-time and longer term estimates.

Objective: For the flu, alternative modeling strategies have been proposed to avoid this delay. We assess one of thesemethods to predict AG up to 3 weeks at different levels.MethodsWe used Web data, Hospital data and Historical data in combination with a model Elastic Net and a smoother.

Results: We observe that up to three weeks forecasts, we still obtain PCC between 0.73 and 0.87 and MSE between 0.533and 0.257 depending to the level of prediction.

Conclusions: We found that external data sources in combination with Elastic Net give accurate estimates. It couldcomplement traditional surveillance.

Druckman J, Ognyanova K, Baum M, Lazer D, Perlis R, Volpe JD, Santillana M, Chwe H, Quintana A, Simonson M. The role of race, religion, and partisanship in misinformation about COVID-19. Group Processes & Intergroup Relations. 2021;24 (4) :638–657.
Perlis R, Santillana M, Ognyanova K, Green J, Druckman J, Lazer D, Baum M. Factors associated with self-reported symptoms of depression among adults with and without a previous COVID-19 diagnosis. JAMA Network Open. 2021;4 (6).
Castro LA, Generous N, Luo W, y Piontti AP, Martinez K, Gomes MFC, Osthus D, Fairchild G, Ziemann A, Vespignani A, et al. Using heterogeneous data to identify signatures of dengue outbreaks at fine spatio-temporal scales across Brazil. PLoS Neglected Tropical Diseases. 2021.Abstract
Dengue virus is spread through mosquitoes in many tropical and subtropical parts of
the world, including Brazil. Each year, dengue virus causes seasonal outbreaks that vary
in magnitude and timing across the country. This variation makes tailoring preparation
efforts for fine spatio-temporal resolutions challenging. In this study, we described four
properties of historical dengue time series at the mesoregion level, the Brazilian
subdivision below state, and examined how they varied across the country. We found
that the duration and timing of seasonal outbreaks are largely driven by climate factors,
while relational properties, i.e., the similarity in outbreak timing and magnitude
between two mesoregions, are explained by a mix of mobility patterns and climate
similarities. Surprisingly, we found that remote sensing derived products and movement
inferred through Twitter were adequate proxies for climate and mobility patterns
respectively. Knowledge of how dengue outbreaks differ across the country and the
factors that may influence specific outbreak properties may be important for improving
efforts to build forecasting and prediction models.
de Salazar P, Link N, Lamarca K, Santillana M. High coverage COVID-19 mRNA vaccination rapidly controls SARS-CoV-2 transmission in Long-Term Care Facilities. Nature Communications Medicine. 2021;1 (16).Abstract
Residents of Long-Term Care Facilities (LTCFs) represent a major share of COVID-19 deaths worldwide. Information on vaccine effectiveness in these settings is essential to improve mitigation strategies, but evidence remains limited. To evaluate the early effect of the administration of BNT162b2 mRNA vaccines in LTCFs, we monitored subsequent SARS-CoV-2 documented infections and deaths in Catalonia, a region of Spain, and compared them to counterfactual model predictions from February 6th to March 28th, 2021, the subsequent time period after which 70% of residents were fully vaccinated. We calculated the reduction in SARS-CoV-2 documented infections and deaths as well as the detected county-level transmission. We estimated that once more than 70% of the LTCFs population were fully vaccinated, 74% (58%-81%, 90% CI) of COVID-19 deaths and 75% (36%-86%) of all documented infections were prevented. Further, detectable transmission was reduced up to 90% (76-93% 90%CI). Our findings provide evidence that high-coverage vaccination is the most effective intervention to prevent SARS-CoV-2 transmission and death. Widespread vaccination could be a feasible avenue to control the COVID-19 pandemic.
Perlis RH, Ognyanova K, Santillana M, Baum MA, Lazer D, Druckman J, Volpe JD. Association of Acute Symptoms of COVID-19 and Symptoms of Depression in Adults. JAMA Network Open. 2021;4 (3) :e213223.Abstract
After acute infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a subset of individuals experience persistent symptoms involving mood, sleep, anxiety, and fatigue, which may contribute to markedly elevated rates of major depressive disorder observed in recent epidemiologic studies. In this study, we investigated whether acute coronavirus disease 2019 (COVID-19) symptoms are associated with the probability of subsequent depressive symptoms.
de Salazar PM, Lu F, Hay JA, Gomez-Barroso D, Fernandez-Navarro P, Martinez EV, Astray-Mochales J, Amillategui R, Garcia-Fulgueiras A, Chirlaque MD, et al. Near real-time surveillance of the SARS-CoV-2 epidemic with incomplete data. medRxiv. 2021.Abstract
Designing public health responses to outbreaks requires close monitoring of population-level health indicators in real-time. Thus an accurate estimation of the epidemic curve is critical. We propose an approach to reconstruct epidemic curves in near real time. We apply this approach to characterize the early SARS-CoV-2 outbreak in two Spanish regions between the months of March and April 2020. We address two data collection problems that affected the reliability of the available real-time epidemiological data, namely, the frequent missing information documenting when a patient first experienced symptoms, and the frequent retrospective revision of historical information (including right censoring). This is done by using a novel back-calculating procedure based on imputing patients dates of symptom onset from reported cases, according to a dynamically-estimated backward reporting delay conditional distribution, and adjusting for right censoring using an existing package, NobBS, to estimate in real time (nowcast) cases by date of symptom onset. This process allows us to obtain an approximation of the time-varying reproduction number (Rt) in real-time. At each step, we evaluate how different assumptions affect the recovered epidemiological events and compare the proposed approach to the alternative procedure of merely using curves of case counts, by report day, to characterize the time-evolution of the outbreak. Finally, we assess how these real-time estimates compare with subsequently documented epidemiological information that is considered more reliable and complete that became available weeks to months later in time. Our approach may help improve accuracy, quantify uncertainty, and evaluate frequently unstated assumptions when recovering the epidemic curves from limited data obtained from public health surveillance systems in other locations.
Mena G, Martinez PP, Mahmud AS, Marquet PA, Buckee CO, Santillana M. Socioeconomic status determines COVID-19 incidence and related mortality in Santiago, Chile. Science. 2021 :eabg5298.Abstract
The current COVID-19 pandemic has impacted cities particularly hard. Here, we provide an in-depth characterization of disease incidence and mortality, and their dependence on demographic and socioeconomic strata in Santiago, a highly segregated city and the capital of Chile. Our analyses show a strong association between socioeconomic status and both COVID-19 outcomes and public health capacity. People living in municipalities with low socioeconomic status did not reduce their mobility during lockdowns as much as those in more affluent municipalities. Testing volumes may have been insufficient early in the pandemic in those places, and both test positivity rates and testing delays were much higher. We find a strong association between socioeconomic status and mortality, measured either by COVID-19 attributed deaths or excess deaths. Finally, we show that infection fatality rates in young people are higher in low-income municipalities. Together, these results highlight the critical consequences of socioeconomic inequalities on health outcomes.
Kiang MV, Santillana M, Chen JT, Onnela J-P, Krieger N, Engø-Monsen K, Ekapirat N, Areechokchai D, Maude R, Buckee CO. Incorporating human mobility data improves forecasts of Dengue fever in Thailand. Scientific Reports. 2021;11 (923).Abstract
Over 390 million people worldwide are infected with dengue fever each year. In the absence of an effective vaccine for general use, national control programs must rely on hospital readiness and targeted vector control to prepare for epidemics, so accurate forecasting remains an important goal. Many dengue forecasting approaches have used environmental data linked to mosquito ecology to predict when epidemics will occur, but these have had mixed results. Conversely, human mobility, an important driver in the spatial spread of infection, is often ignored. Here we compare time-series forecasts of dengue fever in Thailand, integrating epidemiological data with mobility models generated from mobile phone data. We show that long-distance connectivity is correlated with dengue incidence at forecasting horizons of up to three months, and that incorporating mobility data improves traditional time-series forecasting approaches. Notably, no single model or class of model always outperformed others. We propose an adaptive, mosaic forecasting approach for early warning systems.
Kogan NE, Clemente L, Liautaud P, Kaashoek J, Link NB, Nguyen AT, Lu FS, Huybers P, Resch B, Havas C, et al. An Early Warning Approach to Monitor COVID-19 Activity with Multiple Digital Traces in Near Real-Time. Science Advances. 2021;7 (10).Abstract
Given still-high levels of coronavirus disease 2019 (COVID-19) susceptibility and inconsistent transmission-containing strategies, outbreaks have continued to emerge across the United States. Until effective vaccines are widely deployed, curbing COVID-19 will require carefully timed nonpharmaceutical interventions (NPIs). A COVID-19 early warning system is vital for this. Here, we evaluate digital data streams as early indicators of state-level COVID-19 activity from 1 March to 30 September 2020. We observe that increases in digital data stream activity anticipate increases in confirmed cases and deaths by 2 to 3 weeks. Confirmed cases and deaths also decrease 2 to 4 weeks after NPI implementation, as measured by anonymized, phone-derived human mobility data. We propose a means of harmonizing these data streams to identify future COVID-19 outbreaks. Our results suggest that combining disparate health and behavioral data may help identify disease activity changes weeks before observation using traditional epidemiological monitoring.
Poirier C, Hswen Y, Bouzille G, Cuggia M, Lavenu A, Brownstein JS, Brewer T, Santillana M. Influenza forecasting for the French regions by using EHR, web and climatic data sources with an ensemble approach. PLoS One. 2021.Abstract
Effective and timely disease surveillance systems have the potential to help public health officials design interventions to mitigate the effects of disease outbreaks. Currently, healthcare-based disease monitoring systems in France offer influenza activity information that lags real-time by 1 to 3 weeks. This temporal data gap introduces uncertainty that prevents public health officials from having a timely perspective on the population-level disease activity. Here, we present a machine-learning modeling approach that produces real-time estimates and short-term forecasts of influenza activity for the 12 continental regions of France by leveraging multiple disparate data sources that include, Google search activity, real-time and local weather information, flu-related Twitter micro-blogs, electronic health records data, and historical disease activity synchronicities across regions. Our results show that all data sources contribute to improving influenza surveillance and that machine-learning ensembles that combine all data sources lead to accurate and timely predictions.
Lu FS, Nguyen AT, Link N, Molina M, Davis JT, Chinazzi M, Xiong X, Vespignani A, Lipsitch M, Santillana M. Estimating the cumulative incidence of COVID-19 in the United States using influenza surveillance, virologic testing, and mortality data: four complementary approaches. PLoS Computational Biology . 2021.Abstract
Effectively designing and evaluating public health responses to the ongoing COVID-19 pan-demic  requires  accurate  estimation  of  the  prevalence  of  COVID-19  across  the  United  States(US). Equipment shortages and varying testing capabilities have however hindered the useful-ness of the official reported positive COVID-19 case counts. We introduce four complementaryapproaches to estimate the cumulative incidence of symptomatic COVID-19 in each state inthe  US  as  well  as  Puerto  Rico  and  the  District  of  Columbia,  using  a  combination  of  excessinfluenza-like illness reports, COVID-19 test statistics, COVID-19 mortality reports, and a spatially structured epidemic model. Instead of relying on the estimate from a single data source or method that may be biased, we provide multiple estimates, each relying on different assumptions and data sources. Across our four approaches emerges the consistent conclusion that on April 4, 2020, the estimated case count was 5 to 50 times higher than the official positive testcounts across the different states. Nationally, our estimates of COVID-19 symptomatic cases asof April 4 have a likely range of 2.3 to 4.8 million, with possibly as many as 7.6 million cases,up to 25 times greater than the cumulative confirmed cases of about 311,000. Extending ourmethods to May 16, 2020, we estimate that cumulative symptomatic incidence ranges from 4.9 to 10.1 million, as opposed to 1.5 million positive test counts. The proposed combination ofapproaches may prove useful in assessing the burden of COVID-19 during resurgences in the US and other countries with comparable surveillance systems.
Aiken EL, Nguyen AT, Viboud C, Santillana M. Towards the Use of Neural Networks for Influenza Prediction at Multiple Spatial Resolutions. Science Advances. 2021;7 (25).Abstract
Mitigating the effects of disease outbreaks with timely and effective interventions requires accurate real-time surveillance and forecasting of disease activity, but traditional healthcare-based surveil- lance systems are limited by inherent reporting delays. Time-series machine learning methods have the potential to fill this temporal “data gap,” but work to date in this area has focused on relatively simple methods and coarse geographic granularities (state-level and above). We evaluate the performance of a recurrent neural network (gated recurrent unit, or GRU) in comparison to baseline machine learning methods for estimating influenza activity in the US on the state- and city-level, and experiment with the inclusion of real-time search data from Google trends. We find that the GRU improves upon baseline models for long time horizons of prediction but is not improved by real-time Internet search data. We conduct a thorough analysis of feature importance in all considered models for interpretability purposes.
McGough S, Kutz NJ, Clemente LC, Santillana M. A dynamic, ensemble learning approach to forecast dengue fever epidemic years in Brazil using weather and population susceptibility cycles. Journal of the Royal Society Interface. 2021;18 (179).Abstract
Transmission of dengue fever depends on a complex interplay of human, climate, and mosquito dynamics, which often change in time and space. It is well known that disease dynamics are highly influenced by a population’s susceptibility to infection and microclimates, small-area climatic conditions which create environments favorable for the breeding and survival of the mosquito vector. Here, we present a novel machine learning dengue forecasting approach, which, dynamically in time and adaptively in space, identifies local patterns in weather and population susceptibility to make epidemic predictions at the city-level in Brazil, months ahead of the occurrence of disease outbreaks. Weather-based predictions are improved when information on population susceptibility is incorporated, indicating that immunity is an important predictor neglected by most dengue forecast models. Given the generalizability of our methodology, it may prove valuable for public-health decision making aimed at mitigating the effects of seasonal dengue outbreaks in locations globally.
Koplewitz G, Lu F, Clemente L, Buckee C, Santillana M. Predicting Dengue Incidence Leveraging Internet-Based Data Sources. A Case Study in 20 cities in Brazil. medRxiv. 2020;2020.10.21.20210948.Abstract
The dengue virus affects millions of people every year worldwide, causing large epidemic outbreaks that disrupt people’s lives and severely strain healthcare systems. In the absence of a reliable vaccine against it or an effective treatment to manage the illness in humans, most efforts to combat dengue infections have focused on preventing its vectors, mainly the Aedes aegypti mosquito, from flourishing across the world. These mosquito-control strategies need reliable disease activity surveillance systems to be deployed. Despite significant efforts to estimate dengue incidence using a variety of data sources and methods, little work has been done to understand the relative contribution of the different data sources to improved prediction. Additionally, most work has focused on prediction systems at the national level, rather than at finer spatial resolutions. We develop a methodological framework to assess and compare dengue incidence estimates at the city level and evaluate the performance of a collection of models on 20 different cities in Brazil. The data sources we use towards this end are weekly incidence counts from prior years (seasonal autoregressive terms), weekly-aggregated weather variables, and real-time internet search data. We find that a random forest-based model effectively leverages these multiple data sources and provides robust predictions, while retaining interpretability. For real-time predictions that assume long delays (6-8 weeks) in the availability of epidemiological data, we find that real-time internet search data are the strongest predictors of Dengue incidence, whereas for predictions that assume very short delays (1-2 weeks), short-term and seasonal autocorrelation are dominant as predictors. Despite the difficulties inherent to city-level prediction, our framework achieves meaningful and actionable estimates across cities with different characteristics.
Hanage WP, Testa C, Chen JT, David L, Pechter E, Seminario P, Santillana M, Krieger N. COVID-19: US Federal accountability for entry, spread, and inequities – lessons for the future. European Journal of Epidemiology. 2020. Publisher's VersionAbstract
The United States (US) has been among those nations most severely affected by the first—and
subsequent—phases of the pandemic of COVID-19 disease caused by SARS-CoV-2. With only
4% of the worldwide population, the US has seen about 22% of COVID-19 deaths. Despite
formidable advantages in resources and expertise, presently the per capita mortality rate is over
585/million, respectively 2.4 and 5 times higher compared to Canada and Germany. As we enter
Fall 2020, the US is enduring ongoing outbreaks across large regions of the country. Moreover,
within the US, an early and persistent feature of the pandemic has been the disproportionate
impact on populations already made vulnerable by racism and dangerous jobs, inadequate wages,
and unaffordable housing, and this is true for both the headline public health threat and the
additional disastrous economic impacts. In this article we assess the impact of missteps by the
Federal Government in three specific areas: the introduction of the virus to the US and the
establishment of community transmission; the lack of national COVID-19 workplace standards
and lack of personal protective equipment (PPE) for workplaces as represented by complaints to
the Occupational Safety and Health Administration (OSHA) which we find are correlated with
deaths 17 days later (=0.845); and the total excess deaths in 2020 to date, which already total
more than 230,000 and exhibit severe inequities in race/ethnicity including among younger age
Patel B, Sperotto F, Molina M, Kimura S, Delgado M, Santillana M, Kheir JN. Avoidable serum potassium testing in the cardiac intensive care unit: development and testing of a machine learning model. Pediatric Critical Care Medicine. 2020;22 (4).Abstract

Objective: To create a machine learning model identifying potentially avoidable blood draws for
serum potassium among pediatric patients following cardiac surgery.

Design:Retrospective cohort study.
Setting: Tertiary-care center.
Patients: All patients admitted to the CICU at Boston Children’s Hospital between January 2010
and December 2018 with a length of stay ≥4 days and ≥2 recorded serum potassium
Interventions None.
Measurements and Main Results
We collected variables related to potassium homeostasis, including serum chemistry,
hourly potassium intake, diuretics, and urine output. Using established machine
learning techniques, including Random Forest classifiers and hyperparameters, we
created models predicting whether a patient’s potassium would be normal or abnormal
based on the most recent potassium level, medications administered, urine output and markers of renal function. We developed multiple models based on different age-categories and temporal proximity of the most recent potassium measurement. We assessed the predictive performance of the models using an independent test set. Of
the 7,269 admissions (6,196 patients) included, 95,674 serum potassium was measured on average of 1 (IQR 0-1) time per day. 96% of patients received at least
one dose of IV diuretic and 83% received a form of potassium supplementation. Our models predicted a normal potassium value with a median positive predictive value of 0.900. A median percentage of 2.1% measurements (mean 2.5%, IQR 1.3%-3.7%) were incorrectly predicted as normal when they were abnormal.
A median percentage of 0.0% (IQR 0.0%-0.4%) were critically low or high measurements were incorrectly predicted as normal. A median of 27.2% (IQR 7.8%-
32.4%) of samples were correctly predicted to be normal and could have been potentially avoided.
Machine-learning methods can be used to accurately predict avoidable blood tests for
serum potassium in critically ill pediatric patients. A median of 27.2% of samples could
have been saved, with decreased costs and risk of infection or anemia.

Emma-Pascale Chevalier-Cottin, Hayley Ashbaugh, Nicholas Brooke, Gaetan Gavazzi, Mauricio Santillana, Nansa Burlet, Myint Tin Tin Htar. Communicating Benefits from Vaccines Beyond Preventing Infectious Diseases. Infectious Diseases and Therapy. 2020;9 :467–480.Abstract
Despite immunisation being one of the greatest medical success stories of the 20th century and its benefits being widely recognized there is a growing lack of confidence in some vaccines. Improving communication about the direct benefits of vaccination as well as its benefits beyond preventing infectious diseases may help regain this lost trust. A conference was organised at the Fondation Merieux in France to discuss what benefits could be communicated and how their communication could use innovative digital initiatives. During this meeting a wide range of poorly known indirect benefits of vaccination, including benefits for chronic non-communicable diseases (NCD). For example, persons with underlying chronic NCDs, such as diabetes and cardiovascular diseases, are particularly vulnerable to complications, hospitalisations, and even death from influenza, although the link between NCDs and influenza is frequently underestimated. Influenza vaccination can reduce hospitalizations and deaths in older persons with diabetes by 45% and 38% respectively. The frequency of antimicrobial resistance (AMR) is increasing worldwide. Vaccination can reduce AMR by reducing the incidence of infectious disease (though direct and indirect or herd protection), by reducing the number of circulating AMR strains, and by reducing the need for antimicrobial use. In addition, as the global population ages, disease morbidity and treatment costs in the elderly population are likely to rise substantially. The promotion of healthy ageing and adopting a life-course approach to health can reduce the burden of vaccine-preventable diseases such as seasonal influenza, pneumococcal diseases, meningitis, pertussis, shingles, measles, diphtheria and tetanus, which place a significant burden on individuals and the ageing society, and improve their quality of life. Novel disease surveillance systems based on information from Internet search-engines, mobile phone apps, social media, new reports, cloud-based electronic-health records, and crowd-sourced systems, contribute to an improved burden of disease awareness. Examples of the role of new techniques and tools to process data generated by multiple sources, such as artificial intelligence, advanced data analytics and biostatistics to support vaccination programmes, such as influenza and dengue were discussed. The conference participants agreed that continual efforts are needed from all stakeholders to ensure effective, transparent communication of the full benefits and risks of vaccines and vaccination and this will require continued dialogue and collaboration.
Buckee CO, Balsari S, Chan J, Crosas M, Dominici F, Gasser U, Grad YH, Grenfell B, Halloran ME, Kraemer MUG, et al. Aggregated mobility data could help fight COVID-19. Science. 2020;368 (6487) :145-146.Abstract
As the coronavirus disease 2019 (COVID-19) epidemic worsens, understanding the effectiveness of public messaging and large-scale social distancing interventions is critical. The research and public health response communities can and should use population mobility data collected by private companies, with appropriate legal, organizational, and computational safeguards in place. When aggregated, these data can help refine interventions by providing near real-time information about changes in patterns of human movement.
Dai M-Y, Liu D, Liu M, Zhou F-X, .., Mucci LA, Santillana M, Cai H-B. Patients with Cancer Appear More Vulnerable to SARS-CoV-2: A Multi-Center Study During the COVID-19 Outbreak. Cancer Discovery. 2020;DOI: 10.1158/2159-8290.CD-20-0422.Abstract

Background: The novel COVID-19 outbreak, caused by the SARS-CoV-2 virus and originally detected in December 2019 in Wuhan, China, has affected more than 140 countries and territories as of March 2020. Given that patients with cancer are generally more vulnerable to infections, systematic analysis of diverse cohorts of patients with cancer affected by COVID-19 are needed.

Methods: Clinical information from 105 hospitalized patients with cancer and 233 hospitalized patients without cancer, all infected by the SARS-CoV-2 virus, were collected from 14 hospitals in Hubei province, China, from January 1, 2020, to February 24, 2020. Standard statistical methodologies were used to compare four different outcomes: death, admission into an intensive care unit (ICU), development of severe/critical symptoms, and utilization of invasive mechanical ventilation; between patients with cancer (of different types, stages, and treatments of cancer) and patients without cancer.

Findings: Compared with COVID-19 patients without cancer, COVID-19 patients with cancer had higher risks in all four severe outcomes. Patients with blood cancers, lung cancers, or with metastatic cancer (stage IV) had the highest frequency of severe events. Non-metastatic cancer (stage I-III) patients experienced similar frequencies of severe conditions to those observed in patients without cancer. Patients who received immunotherapy and surgery had higher risks of having severe events, while patients with only radiotherapy and targeted therapy did not demonstrate significant differences in severe events when compared to patients without cancer.

Interpretations: Patients with blood cancer, lung cancer, and metastatic cancer demonstrated a higher incidence of severe events compared to patients without cancer. In addition, patients who underwent immunotherapy or cancer surgery had higher death rates and higher chances of having critical symptoms.