Study data
Malaria seasonality in Madagascar was modelled using monthly case reports submitted by health facilities to the centralised Ministry of Health between 2013 and 2016. This dataset was provided by the National Malaria Control Programme (NMCP) in Madagascar and gave the number of patients who tested positive for malaria by rapid diagnostic tests (RDTs), irrespective of species. After cleaning the data to ensure consistent nomenclature, 2801 health facilities were successfully geo-located and verified using a database of 3274 geo-located health facilities from the Institut Pasteur de Madagascar (IPM). The IPM database itself was further validated against a database of 120 global positioning system (GPS) geo-located health facilities from the President’s Malaria Initiative (PMI).
To account for year-on-year trends and help avoid unwanted bias in the monthly health facility data from, for example, stock-outs of RDTs, median monthly case counts were derived for each site. With a focus on location-specific seasonal trends, monthly proportions were obtained by dividing the monthly case medians by their annual sum. This is illustrated for an example Malagasy health centre in Fig. 1. After excluding sites with incomplete intra-annual patterns, data from 2669 health facilities were used in the later analysis.
By focusing on proportions instead of the case counts, we bypass the need to estimate catchment populations for the health facilities. Deriving incidence measures using catchment population data is commonly used to adjust for magnitude differences in count data, but estimating these catchment population sizes could have introduced bias to the analysis as only an incomplete set of health facility geolocation data was available to inform this analysis. Additional features of this modelling approach include the following: (a) utilising a standardised definition of a transmission period as the months with the proportion of cases exceeding 1/12 of the annual total, and (b) restricting the analysis to dynamic environmental covariates like rainfall and temperature that are likely to impact seasonal malaria transmission patterns.
A suite of environmental variables were assembled as 5 km spatial grids. These were temporally aligned with the health facility data and included the Climate Hazards Group Infrared Precipitation and Station data (CHIRPS), enhanced vegetation index (EVI), daytime land surface temperature (LST_day), diurnal difference in land surface temperature (LST_delta), night-time land surface temperature (LST_night), tasselled cap brightness (TCB), tasselled cap wetness (TCW), and the temperature suitability indices for Plasmodium falciparum and Plasmodium vivax (TSI_Pf and TSI_Pv). Details on the sources of these data are available elsewhere [19, 20]. To relate the observed seasonal patterns to the potential driving factors, monthly medians of these environmental data were derived and standardised. One to 3 month lags were included for each covariate to allow for delayed and accumulated responses to these environmental variables [19].
Spatiotemporal monthly proportion model
Proportions or probabilities are often modelled using multinomial or compositional regressions. However, in this scenario, this would mean computing ratios with respect to a fixed reference month. By using monthly case proportions, it is easier to relate each month’s proportion to the values of its covariates explicitly. To estimate monthly case proportions over the study region, we use the following spatiotemporal model which can be viewed as the linear predictor of a multinomial regression written in a log-linear form:
$$ \log(p_{i,j}) = \boldsymbol{X}_{ij}\boldsymbol{\beta} + \phi_{ij} + \epsilon_{ij}. $$
(1)
Here, pi,j is the average proportion of cases at location j in month i. To avoid applying logarithms on zeros, we added an offset of 0.00001 to pi,j and rescaled the raw monthly proportions at each location to sum to one before modelling. Similarly, rescaling was conducted after modelling with a location-specific normalising constant. This allows locations to be more or less sensitive to the variation in the underlying covariates.
In Eq. (1), \(\boldsymbol {X}_{ij}\in \mathbb {R}^{n\times m}\) is a covariate design matrix including an intercept, \(\boldsymbol {\beta }\in \mathbb {R}^{m}\) is the corresponding parameter vector, and \(\epsilon \sim N(0, \sigma _{e}^{2})\) denotes the independent, identically distributed noise. The spatiotemporal Gaussian field ϕ is constructed such that:
$$ \phi_{i,j} = \left\{\begin{array}{l} \xi_{1j} \text{ for}\, i = 1, \\ a \phi_{i-1,j} + \xi_{i, j} \text{ for} \, i=2, \dots, 12, \end{array}\right. $$
(2)
|a|<1 and ξi,j correspond to zero-mean Gaussian innovations which are temporally independent but spatially coloured with a Matérn covariance:
$$ \text{Cov}(h) = \frac{\sigma^{2}_{f}}{\Gamma(\nu)2^{\nu-1}}(\kappa h)^{\nu} K_{\nu}(\kappa h), $$
(3)
where h is the distance between two locations and κ>0 is a scaling parameter. In practice, it is difficult to identify the order of the modified Bessel function of second kind, denoted by Kν in Eq. (3) [21]. Thus, this was set to 1 as per the convention in the R-INLA package, which was used for model fitting and selection [22–24].
Before fitting the model, the log proportions were examined and outliers were excluded to model prototypical seasonal behaviour. Based on the histogram of log(pi,j) values in Additional file 1: Figure S1, log(pi,j)≤− 11 were deemed as outliers. Since the proportions were previously rescaled to sum up to one at each location, this does not mean that all data zeros were excluded from the analysis. Instead, zeros were only removed if much higher case proportions were observed in other months.
A randomly selected 30% of sites were excluded from the model fitting to validate our results. Working with data from the remaining 70% of the locations, the set of covariates was reduced to facilitate model selection and account for multicollinearity by iteratively computing the variance inflation factors (VIFs) and removing the covariates with the highest VIF value until all the remaining covariates have VIF values less than 10. Since these covariates have low correlations with each other, their estimated regression coefficients are more robust.
To speed up the covariate selection via backwards regression, the 70% training set was randomly split into two smaller sets of equal size and spatial coverage to search for the best model in terms of deviance information criterion (DIC). A map of the test and training locations is shown in Additional file 1: Figure S2. The two resulting candidates from the separate backwards regressions on the two smaller training sets were then refitted to the whole training set to select the final model with the lower DIC. After checking for reasonable results on the test data, the model was refitted to the entire dataset before predictions were made over the gridded surface. An analysis of the robustness of the methodology to the spatial and temporal extents as well as the quality of the data is provided in the Additional file 1.
Seasonality index and monthly case incidence
Seasonality statistics were derived from each posterior sample of the location-specific monthly proportions. To quantify ‘how seasonal’ a location is, a seasonality index was defined [25]. For location j, this index is given by the product of an entropy measure and the normalised amplitude:
$$\begin{array}{*{20}l} S_{j} &= D_{j} \times \frac{R_{j}}{R_{\text{max}}}, \end{array} $$
(4)
$$\begin{array}{*{20}l} \text{where}\, D_{j} &= \sum_{i = 1}^{12} p_{i,j} \log_{2}\left(\frac{p_{i,j}}{q}\right). \end{array} $$
(5)
Since pi,j is the case proportion for month i and q=1/12, Dj corresponds to the Kullback–Leibler divergence between the estimated intra-annual distribution and a uniform distribution. Thus, it quantifies how different the monthly proportions are from a uniform distribution over the year. In the context of malaria, Rj can be represented by the API at location j and Rmax is the maximum API over the region.
One benefit of using Sj is that it separates the timing and amplitude aspects of seasonality. Since high-resolution malaria burden estimates already exist [26–28], the model did not need to estimate the number of cases and could focus exclusively on estimating the monthly proportions at each location. Estimates of the monthly parasite incidence (MPI) for each location could also be obtained by multiplying the estimated monthly proportions with the mean Pf API estimates for 2016. This synthesises the estimated seasonal pattern obtained from the health facility data with the magnitude-level information provided by household prevalence surveys since these were used in the API model.
Deriving seasonality features
Locations were considered as potentially seasonal if their entropy Dj>0. When this criterion was satisfied, a rescaled von Mises (RvM) density was fitted via least squares to the estimated monthly proportions. This is illustrated for an example Malagasy health facility in Additional file 1: Figure S3.
By treating the month in a year as a random variable on a circle, i.e. defining \(\theta = \frac {2\pi i}{12}\) where i=1,…,12, the two-component RvM density function can be written as follows:
$$ \begin{aligned} f(\theta; s, \omega, \mu_{1}, \kappa_{1}, \mu_{2}, \kappa_{2}) &= s\left[\omega f_{1}(\theta; \mu_{1}, \kappa_{1})\right. \\&\quad\left.+ (1-\omega) f_{2}(\theta; \mu_{2}, \kappa_{2})\right] \end{aligned} $$
(6)
$$ \begin{aligned} \text{where}\, f_{k}(\theta; \mu_{k}, \kappa_{k}) &= \frac{1}{2\pi I_{0}(\kappa_{k})} \exp\{\kappa_{k}\cos(\theta - \mu_{k})\} \end{aligned} $$
(7)
is a one-component vM density for k=1,2 with mean and concentration parameters μk and κk. Here, I0 is the modified Bessel function and ω is a probability weight. The scale parameter s>0 modulates between the continuous density function and monthly proportions over discrete months.
Instead of identifying characteristics based on the monthly proportion estimates directly, seasonal features were based on fitted vM curves, which further smoothed out the estimates. The benefit of using a circular distribution was the continuity of the curve between the months of December and January. Using vM densities, in particular, is convenient for identifying the peaks of the distribution since these correspond to the mean parameters [29]. By comparing the values of the fitted curve, the major and the minor peaks of a bimodal distribution can be identified. Although an arbitrary number of von Mises components can be used, one or two were used because areas with seasonal malaria transmission typically have one or two main seasons [2].
To reduce computational burden, a bimodal distribution was only considered if the error from the fit of a unimodal distribution exceeded a set threshold \(\tilde {\epsilon }>0\). For Madagascar, \(\tilde {\epsilon }\) was empirically chosen to be 0.0015. Based on the fitted vM curve, the transmission periods were identified by marking the months where the curve was at or above \(\frac {1}{12}\). In this way, the start, end, and length of each season could also be estimated. Algorithm 1 summarises the procedure used to obtain the seasonality statistics from the monthly proportion realisation curves.
To quantify the uncertainty associated with the derived statistics, the results from 100 posterior samples of the monthly proportions were summarised. A location was deemed as unimodal or bimodal if more than half of the samples supported that interpretation and the degree of certainty was the proportion of such samples. Based on this majority decision and the corresponding samples, the uncertainty was also assessed in the estimated seasonal characteristics. Circular medians and deviations were used for the start, end, and peak of each transmission season [30]. To interpret the deviations in terms of months, the circular deviations was multiplied by \(\frac {12}{2\pi }\).