A dynamic neural network model for real-time prediction of the Zika epidemic in the Americas

Background In 2015 the Zika virus spread from Brazil throughout the Americas, posing an unprecedented challenge to the public health community. During the epidemic, international public health officials lacked reliable predictions of the outbreak’s expected geographic scale and prevalence of cases, and were therefore unable to plan and allocate surveillance resources in a timely and effective manner. Methods In this work we present a dynamic neural network model to predict the geographic spread of outbreaks in real-time. The modeling framework is flexible in three main dimensions i) selection of the chosen risk indicator, i.e., case counts or incidence rate, ii) risk classification scheme, which defines the relative size of the high risk group, and iii) prediction forecast window (one up to 12 weeks). The proposed model can be applied dynamically throughout the course of an outbreak to identify the regions expected to be at greatest risk in the future. Results The model is applied to the recent Zika epidemic in the Americas at a weekly temporal resolution and country spatial resolution, using epidemiological data, passenger air travel volumes, vector habitat suitability, socioeconomic and population data for all affected countries and territories in the Americas. The model performance is quantitatively evaluated based on the predictive accuracy of the model. We show that the model can accurately predict the geographic expansion of Zika in the Americas with the overall average accuracy remaining above 85% even for prediction windows of up to 12 weeks. Conclusions Sensitivity analysis illustrated the model performance to be robust across a range of features. Critically, the model performed consistently well at various stages throughout the course of the outbreak, indicating it’s potential value at the early stages of an epidemic. The predictive capability was superior for shorter forecast windows, and geographically isolated locations that are predominantly connected via air travel. The highly flexible nature of the proposed modeling framework enables policy makers to develop and plan vector control programs and case surveillance strategies which can be tailored to a range of objectives and resource constraints.


Introduction
The Zika virus, which is primarily transmitted through the bite of infected Aedes aegypti mosquitoes [1], was first discovered in Uganda in 1947 [2] from where it spread to Asia in 1960s, where it since has caused small outbreaks. In 2007 ZIKV caused an island wide outbreak in Yap Island, Micronesia [3], followed by outbreaks in French Polynesia [4] and other Pacific islands between 2013 ̶ 2014 where attack rates where up to 70% [5][6][7]. It reached Latin America between late 2013 and early 2014, but was not detected by public health authorities until May 2015 [8] and since affected 48 countries and territories in the Americas [9][10][11]. Since there is no vaccination or treatment available for Zika infections [12,13], the control of Ae. aegypti mosquito populations remains the most important intervention to contain the spread of the virus [14]. In order to optimally allocate resources to suppress vector populations, it is critical to accurately anticipate the occurrence and arrival time of arboviral infections to detect local transmission [15].
The aims of our work are to 1) present recurrent neural networks for time ahead predictive modelling as a highly flexible tool for outbreak prediction, and 2) implement and evaluate the model performance for the Zika epidemic in the Americas. The application of neural networks for epidemic risk forecasting has previously been applied to dengue forecasting and risk classification [45][46][47][48][49][50], detection of mosquito presence [51], temporal modeling of the oviposition of Aedes aegypti mosquito [52], Aedes larva identification [53], and epidemiologic time-series modeling through fusion of neural networks, fuzzy systems and genetic algorithms [54]. Recently, Jian et al [55] performed a comparison of different machine learning models to map the probability of Zika epidemic outbreak using publically available global Zika case data and other known covariates of transmission risk. Their study provides valuable insight into the potential role of machine learning models for understanding Zika transmission; however, it is static in nature, i.e., it does not account for time-series data, and did not account for human mobility, both of which are incorporated in our modelling framework.
Here, we apply a dynamic neural network model for N-week ahead prediction for the 2015-2016 Zika epidemic in the Americas. The model implemented in this work relies on multi-dimensional time-series data at the country (or territory) level, specifically epidemiological data, passenger air travel volumes, vector habitat suitability for the primary spreading vector Ae. aegypti, socioeconomic and population data. The modeling framework is flexible in three main dimensions: 1) the preferred risk indictor can be chosen by the policy maker, e.g., we consider outbreak size and incidence rate as two primary indicators of risk for a region, 2) five risk classification schemes are defined, where each classification scheme varies in the threshold used to determine the set of countries deemed "high risk", and 3) it can be applied for a range of forecast windows (1 -12 weeks). Model performance and robustness is evaluated for various combinations of risk indicator, risk classification level, and forecasting windows. Thus, our work represents the first flexible framework of neural networks for epidemic risk forecasting, that allows policy makers to evaluate and weigh the trade-off in prediction accuracy between forecast window and risk classification schemes. Given the availability of the necessary data, the modelling framework proposed here can be applied in real time to future outbreaks of Zika, and other similar vector-borne outbreaks.

Data
The model relies on socioeconomic, population, epidemiological, travel and mosquito vector suitability data. All data is aggregated to the country level, and provided for all countries and territories in the Americas. Each data set and corresponding processing is described in detail below, and summarized in Table 1. The data is available as a supplementary file.

Epidemiological Data
Weekly Zika infected cases for each country and territory in the Americas were extracted from the Pan American Health Organization (PAHO) [57], as described in previous studies [40,43] (data available: github.com/andersen-lab/Zika-cases-PAHO). Although Zika cases in Brazil were reported as early as May 2015, no case data is available for all of 2015 from PAHO because the Brazil Ministry of Health did not declare the Zika cases and associated neurological and congenital syndrome as notifiable conditions until 17 February of 2016 [57]. The missing numbers of cases from July to December 2015 for Brazil were estimated based on the positive correlation between Ae. aegypti abundance (described below) and reported case counts as has been done previously [42,43]. We used smoothing spline [56] to estimate weekly case counts from the monthly reported counts. The weekly country level case counts ( Figure 1A) were divided by the total population / 100,000, as previously described [43], to compute weekly incidence rates ( Figure 1C).

Travel Data
Calibrated monthly passenger travel volumes for each airport-to-airport route in the world were provided by the International Air Transport Associate (IATA) [67], as previously used in [43,58]. The data includes origin, destination and stopover airport paths for 84% of global air traffic, and includes over 240 airlines and 3,400 airports. The airport level travel was aggregated to a regional level, to compute monthly movements between all countries and territories in the Americas. The incoming and outgoing travel volumes for each country and territory, originally available from IATA at a monthly temporal resolution, were curve fitted, again using smoothing spline method [56] to obtain corresponding weekly volumes to match with the temporal resolution of our model. In this study, data and estimates from 2015 were also used for 2016, as was done previously [43,58,59].

Mosquito Suitability Data
The monthly vector suitability data sets were based on habitat suitability for the principal Zika virus species Ae. aegypti, previously used in [43], and initially estimated using original high resolution maps [60] and then enriched to account for seasonal variation in the geographical distribution of Ae. aegypti by using timevarying covariate such as temperature persistence, relative humidity, and precipitation as well as static covariates such as urban versus rural areas. The monthly data was translated into weekly data using a smoothing spline [56].

Socioeconomic and Human Population Data
For a country, to prevent or manage an outbreak depends on their ability to implement a successful surveillance and vector control programs [68]. Due to a lack of global data to quantify vector control at country level, we utilized alternative economic and health related country indicators which have previously been revealed to be critical risk factors for Zika spread [43]. A country's economic development can be measured by the gross domestic product (GDP) per capita at purchasing power parity (PPP), in international dollars. The figures from World Bank [61] and the U.S. Bureau of Economic Analysis [62] were used to collect GDP data for each country. The number of physicians and the number of hospital beds per 10,000 people were used to indicate the availability of health infrastructure in each country. These figures for U.S. and other regions in the Americas were obtained from the Centre of Disease Control and Prevention (CDC) [63], WHO World Health Statistics report [64], and the PAHO [65]. Finally, the human population densities (people per sq. km of land area) for each region were collected from World Bank [66] and the U.S. Bureau of Economic Analysis [62].

Connectivity-risk Variables
In addition to the raw input variables, novel connectivity-risk variables are defined and computed for inclusion in the model. These variables are intended to capture the risk posed by potentially infected travelers arriving at a given destination at a given point in time, and in doing so, explicitly capture the dynamic and heterogeneity of the air-traffic network in combination with real-time outbreak status. Two variables are chosen, hereafter referred to as case-weighted travel risk and incidence-weighted travel risk, as defined in equations (1.a) and (1.b), respectively.
For each region j at time t, and are computed as the sum of product between passenger volume traveling from origin i into destination j at time t ( , ) and the state of the outbreak at origin i at time t, namely reported cases, , or reported incidence rate, . Each of these two variables is computed for all 53 countries or territories for each of the 78 epidemiological weeks. The dynamic variables are illustrated in Figure  1, below the raw case counts and incidence rates.

Fig 1. Weekly distribution of case-related input variables. (A) Zika cases and (B)
incidence rates in the Americas along with connectivity-risk variables, (C) caseweighted travel risk , and (D) incidence weighted travel risk , for top 10 countries and territories in the Americas.

Neural Network Model
A class of neural architectures based upon Nonlinear AutoRegressive models with eXogenous inputs (NARX) known as NARX neural networks [69][70][71] is employed herein due to its suitability for modeling of a range of nonlinear systems and computational capabilities equivalent to Turning machines [72]. The NARX networks, as compared to other recurrent neural network architectures, require limited feedback (i.e., feedback from the output neuron rather than from hidden states) and converge much faster with a better generalization [72,73]. The NARX model can be formalized as follows [72]: where ( ) and ( ) denote, respectively, the input and output (or target that should be predicted) of the model at discrete time t, while and (with ≥ 1, ≥ 1, and ≤ ) are input and output delays called memory orders (Fig. 2). In this work, a NARX model is implemented to provide N-step ahead prediction of a time series, as defined below: Here, ( + ) is the risk classification predicted for the k th region N weeks ahead (of present time t), which is estimated as a function of ( ) inputs from all = 1, 2, … , regions for previous weeks, and the previous risk classification state, ( ) for region k for previous weeks. The prediction model is applied at time t, to predict for time t+N, and therefore relies on data available up until week t. That is, to predict outbreak risk for epidemiological week X, N-weeks ahead, the model is trained and tested using data available up until week (X -N). For example, 12-week ahead prediction for Epi week 40, is performed using data available up to week 28. The function (•) is an unknown nonlinear mapping function that is approximated by a Multilayer Perceptron (MLP) to form the NARX recurrent neural network [70,71].

Fig 2.
Schematic of NARX network with input and output delays: Each neuron produces a single output based on several real-valued inputs to that neuron by forming a linear combination using its input weights and sometimes passing the output through a nonlinear activation function: = (∑ + ) = ( + = ), where denotes the vector of weights, is the vector of inputs, is the bias and is a linear or nonlinear activation function (e.g., Linear, Sigmoid, and Hyperbolic tangent [75]).
In the context of this work, the desired output, ( + ), is a binary risk classifier, i.e., classifying a region k as high or low risk at time at time t+N, for each region, k, N weeks ahead (of t). The vector of input variables for region at time is ( ), and includes both static and dynamic variables. We consider various thresholds to define the "high risk" group, ranging uniformly between 10% and 50%, where the 10% scheme classifies the 10% of countries reporting the highest number of cases (or highest incidence rate) as high risk, and the other 90% as low risk, similar to [37]. Each thresholds corresponds to a risk classification scheme (i.e., R=10, R=20, etc). Critically, our prediction approach differs from [37], in that our model is trained to predict the risk level directly, rather than predict the number of cases, which are postprocessed into risk categories. The performance of the model is evaluated by Bias Weights Inputs comparing the estimated risk level (high or low) to the actual risk level for all locations at a specified time, The actual risk level is simply defined at each time period t during the outbreak by ranking the regions based on to the number of reported case counts (or incidence rates), and grouping them into high and low risk groups according to the specified threshold.
The static variables used in the model include GDP PPP, population density, number of physicians, and number of hospital beds for each region. The dynamic variables include mosquito vector suitability, outbreak status (both reported case counts and reported incidence rates), total incoming travel volume, total outgoing travel volume, and the two connectivity-risk variables defined as in Equations (1. A major contribution of this work is the flexible nature of the model, which allows policy makers to be more or less risk averse in their planning and decision making. Firstly, the risk indicator (used to rank the regions and identify the high risk group) can be chosen by the modeler; in this work we consider two regional risk indicators, i) the number of reported cases and ii) incidence rate. Second, we consider a range of risk classification schemes, which vary by the relative size of the "high risk" group, i.e., R=10, R=20, R=30, R=40, R=50. Third, the forecast window, N, is defined to range from N = 1, 2, 4, 8 and 12 weeks. Subsequently, any combination of risk indicator, risk classification scheme and forecasting window can be modelled.
In initial settings of the series-parallel NARX neural network, a variety numbers of hidden layer neurons and numbers of tapped delay lines (Eq. (2)) were explored for training and testing of the model. Sensitivity analysis revealed minimal difference in performance of the model under different settings. Therefore, for all experiments presented in this work, the numbers of neural network hidden layer neurons and tapped delay lines are kept constant as two and four, respectively.
To train and test the model, the actual risk classification for each region at each week during the epidemic, ( ), was used. For each model run, e.g., a specified risk indicator, risk classification scheme and forecasting window, the input and target vectors are randomly divided into three sets: 1. 70% for training, to tune model parameters minimizing the mean square error between the outputs and targets, 2. 15% for validation, to measure network generalization and to prevent overfitting, by halting training when generalization stops improving (i.e., mean square error of validation samples starts increasing), and 3. 15% for testing, to provide an independent measure of network performance during and after training.
The performance of the model is measured using two metrics: 1) prediction accuracy (ACC) and 2) receiver operating characteristic (ROC) curves. Prediction accuracy is defined as ACC = (TP + TN) / (TP + FP + TN + FN), where true positive (TP) is the number of high risk locations correctly predicted as high risk, false negative (FN) is the number of high risk locations incorrectly predicted as low risk, true negative (TN) is the number of low risk locations correctly predicted as low risk, and false positive (FP) is the number of low risk locations incorrectly predicted as high risk. The second performance metric, ROC curve, explores the effects on TP and FP as the position of an arbitrary decision threshold is varied, which in the context of this prediction problem distinguished low and high risk locations. ROC curves were originally developed in 1950s as a technique for visualizing, organizing and selecting classifiers based on their performance [76]. The ROC curve can be characterized as a single number using the area under the ROC curve (AUC), with larger areas having an AUC that approaches 1 indicating a more accurate detection method. In addition to quantifying model performance using these two metrics, we evaluate the robustness of the predictions by comparing the ACC across multiple runs that vary in their selection of testing and training sets (resulting from the randomized sampling). Due to computation time, the robustness is only evaluated for the 4-week forecast window.

Results and Discussion
The model outcome reveals the set of locations expected to be at highest risk at a specified date in the future, i.e., N weeks ahead of when the prediction is made. We apply the model for all epidemiological weeks throughout the epidemic, and evaluate performance under each combination of i) risk indicator, ii) classification scheme, and iii) forecast window. For each model run, both ACC and ROC AUC are computed.
Results are presented in this section as follows: 1. Country-level Outbreak Prediction. a. Performance sensitivity to classification scheme is presented at the country level, for a fixed forecast window (N=4). b. Performance sensitivity to forecast window is presented at the country level, for a fixed classification scheme (R=20).  The results reveal that the performance of model, expectedly, deteriorates as the forecast window increases; however, the average accuracy remains above 80% for prediction up to 8-weeks ahead, and well about 90% for up to 4-weeks ahead. The prediction accuracy for the Caribbean slightly lags the average performance in the Americas. Specifically, for R=20, 5 of the 11 Caribbean regions were designated as HIGH risk locations at Epi week 40, i.e., Dominican Republic, Guadeloupe, Jamaica, Martinique, Puerto Rico. For a one-week prediction window, N=1, the model was able to correctly predict 3 of the high risk regions (i.e., Jamaica, Martinique, Puerto Rico), for N=2 it correctly identified two (i.e., Martinique, Puerto Rico), and for N=4, it again correctly identified three (i.e., Guadeloupe, Martinique, Puerto Rico). However, the model did not correctly predict any high risk locations in the Caribbean at N=8 and N=12 window lengths. This error is likely due to the low and sporadic reporting of Zika cases in the region around week 30. Similar prediction capability is illustrated for R=50 (not shown in the figure), in which case out of the 13 Caribbean HIGH risk locations, the model correctly identifies all locations at N=1, 2 and 4, 10 of the 13 locations at N=8, and only 1 of the 13 at N=12.   , the red indicates a correctly predicted high-risk country and green indicates a correctly predicted low risk country. Light grey indicates an incorrectly predicted high risk country, and dark grey indicates an incorrectly predicted low risk country. The risk indicator used is case counts.

Model performance
The remainder of this section demonstrates the model's performance sensitivity to the range of flexible input parameters available. Fig 5 and 6 illustrate the model performance as a function of classification scheme and forecast window, aggregated over space and time. Specifically, Fig 5 shows the model performance based on ACC, averaged over all locations and all EPI weeks for each combination of risk classification scheme (i.e., R = 10, 20, 30, 40 and 50) and forecast window (i.e., N = 1, 2, 4, 8 and 12) (Fig. 5). In general, the performance of the model decreases as the prediction window increases, and as the size of the high risk group increases. When the objective is to identify the top 10% of at-risk regions, the average accuracy of the model remains above 87% for prediction up to 12-weeks in advance. Further, the model is almost 80% accurate for 4-week ahead prediction for all classification schemes, and almost 90% accurate for all 2-week ahead prediction scenarios, i.e., the correct risk category of 9 out of 10 locations can always be predicted. These results reveal the trade-off between desired forecast window and precision of the high risk group. The quantifiable trade-off between the two model inputs (classification scheme, R, and forecast window, N) can be useful for policies which may vary in desired planning objectives.  The aggregated ROC curves (averaged over all locations and all epidemiological weeks) are presented in Fig. 6, and reveal the (expected) increased accuracy of the model as the forecast window is reduced. The ROC AUC results are consistent with ACC results presented in Fig. 5, highlighting the superior performance of the 1 and 2 week ahead prediction capability of the model. The ROC AUC value remains above 0.91 for N=1,2 and above 0.83 for N=4, both indicating high predictive accuracy of the model.   We further explore the model performance at a regional level by dividing the countries and territories in the Americas into three groups, namely Caribbean, South America and Central America, as in [10]. For each group the average performance of the model in terms of ACC was evaluated and compared. The results in Fig 8 reveals a similar trend at the regional level as was seen at the global level, with a decrease in predictive accuracy as the forecast window increases in length, and the and high risk group increases in size. The results reveal the predictive accuracy is best for the Caribbean region, while predictions for Central America were consistently the worst; the discrepancy in performance between these groups increases as the forecast window increases. The difference in performance across regions can be attributed to the high spatial heterogeneity of the outbreak patterns, the relative ability of air travel to accurately capture connectivity between locations, and errors in case reporting that may vary by region. For example, the Caribbean, which consists of more than twice as many locations as any other group, first reported cases around week 25, and remained affected throughout the epidemic. In contrast, Central America experienced a slow start to the outbreak (at least according to case reports) with two exceptions, namely Honduras and El Salvador. The large number of affected region in the Caribbean, with more reported cases distributed over a longer time period contributed to the training of the model, thus improving the predictive capability for these regions. Additionally, the geographically isolated nature of Caribbean islands enables air travel to more accurately capture incoming travel risk, unlike countries in Central and South America, where individuals can also move about using alternative modes, which are not accounted for in this study. These factors combined explain the higher predictive accuracy of the model for the Caribbean region, and importantly, helps to identify the critical features and types of settings under which this model is expected to perform best.

Conclusions
We have introduced a flexible, predictive modelling framework to forecast outbreak risk in real-time. An application of the model was applied to the Zika epidemic in the Americas at a weekly temporal resolution, and country-level spatial resolution, using population, socioeconomic, epidemiological, travel patterns and vector suitability data. The model performance was evaluated for various risk classification schemes, forecast windows and risk indicators, and illustrated to be accurate and robust across a broad range of these features. First, the model is more accurate for shorter prediction windows and restrictive risk classification schemes. Secondly, regional analysis reveals superior predictive accuracy for the Caribbean, suggesting the model to be best suited to geographically isolated locations that are predominantly connected via air travel. Predicting the spread to areas that are relatively isolated has previously been shown to be difficult due to the stochastic nature of infectious disease spread [77]. Thirdly, the model performed consistently well at various stages throughout the course of the outbreak, indicating it's potential value at the early stages of an epidemic. The outcomes from the model can be used to better guide outbreak resource allocation decisions, and can be easily adapted to model other vector-borne epidemics.
There are several limitations of this work. The underlying data on case reporting vary by country and may not represent the true transmission patterns [78]. However, the framework presented was flexible enough to account for these biases and we anticipate will only be improved as data become more robust. Additionally, 2015 travel data was used in place of 2016 data, as has been done previously [43,58,59], which may not be fully representative of travel behaviour. Lastly, due to the lack of spatial resolution of case reports, we were limited to make country to country spread estimates. We do however appreciate that there is considerable spatial variation within countries (i.e., northern vs. southern Brazil) and that this may influence the weekly covariates used in this study. We again hypothesise that models will become better as spatial resolution increases. Data Accessibility. All data used in this study is provided as supplementary material.
Competing Interests. We have no competing interests.

Authors' Contributions.
LG and MA conceived the study, designed the experiments, analyzed the model results, and drafted the original manuscript. MA developed the model and performed the computational analysis. All authors contributed to data curation and editing of the manuscript.
LG supervised the study.