' s response to reviews Title : Performance of InterVA for Assigning Causes of Death to Verbal Autopsies : Multi-Site Validation Study using Clinical Diagnostic Gold Standards

Background Recently, a new algorithm for automatic computer certification of verbal autopsy data named InSilicoVA was published. The authors presented their algorithm as a statistical method and assessed its performance using a single set of model predictors and one age group. Methods We perform a standard procedure for analyzing the predictive accuracy of verbal autopsy classification methods using the same data and the publicly available implementation of the algorithm released by the authors. We extend the original analysis to include children and neonates, instead of only adults, and test accuracy using different sets of predictors, including the set used in the original paper and a set that matches the released software. Results The population-level performance (i.e., predictive accuracy) of the algorithm varied from 2.1 to 37.6% when trained on data preprocessed similarly as in the original study. When trained on data that matched the software default format, the performance ranged from −11.5 to 17.5%. When using the default training data provided, the performance ranged from −59.4 to −38.5%. Overall, the InSilicoVA predictive accuracy was found to be 11.6–8.2 percentage points lower than that of an alternative algorithm. Additionally, the sensitivity for InSilicoVA was consistently lower than that for an alternative diagnostic algorithm (Tariff 2.0), although the specificity was comparable. Conclusions The default format and training data provided by the software lead to results that are at best suboptimal, with poor cause-of-death predictive performance. This method is likely to generate erroneous cause of death predictions and, even if properly configured, is not as accurate as alternative automated diagnostic methods. Electronic supplementary material The online version of this article (10.1186/s12916-018-1039-1) contains supplementary material, which is available to authorized users.


Background
Reliable population-level cause-of-death estimates are critically important for designing effective public health policies [1]. Verbal autopsy (VA) is a key component of enhancing health information systems in many countries that do not have reliable civil registration and vital statistics systems [2,3]. VA consists of a structured interview with family members of the deceased with the purpose of gathering enough information to infer the likely cause of death [4]. In some countries where up to 80-90% percent of deaths occur without medical attendance, VA provides the only usable information for generating population-level cause-of-death estimates with reasonable and representative coverage [5]. Computer algorithms that can reliably assign a cause of death greatly increase the feasibility of integrating VA routinely into civil registration and vital statistics (CRVS) systems. Computer certification of verbal autopsy (CCVA) allows systems to be scalable, consistent, and sustainable [6].
Numerous algorithms for predicting the cause of death from VAs have been developed over the last decade [7][8][9][10][11]. We previously developed a framework for validating the predictive accuracy of different diagnostic methods that allows for direct comparison of methods using the same standard set of criteria [12]. It provides a way of determining how well an algorithm will perform in different populations when the true distribution of causes of death is not known. This is crucial for generalizing results to new study populations and accurately capturing unknown changes in cause-of-death composition in the same population across time. We have used this procedure to determine the accuracy of a wide range of previously developed methods [13].
Recently, a new algorithm for CCVA called InSilicoVA was developed and published [14]. This method builds on previous research on the InterVA algorithm, and advances the approach by introducing an algorithm that quantifies uncertainty in the individual-level predictions and uses this information to better predict the cause distribution at the population level. This aligns well with the current global interest in using VA to estimate the distribution of causes of death for populations through routine application in vital registration systems. The authors use a range of metrics to determine the performance of their algorithm, including applying our assessment framework. However, the authors only validated the results for adult deaths and not child or neonatal deaths. Moreover, given the potential of such methods for transforming knowledge about cause-of-death patterns in populations for which little is currently known about the leading causes of death, we believe that an independent validation of their results is warranted before the method can be recommended for routine application.
In this study, we assess the diagnostic accuracy of the InSilicoVA algorithm for all ages using the same validation environment as used in the original InSilicoVA paper, namely the Population Health Metrics Research Consortium (PHMRC) gold standard database. We applied the validation procedure developed by Murray et al. [12] and assessed performance at the individual level, using chance-corrected concordance (CCC), and at the population level, using chance-corrected causespecific mortality fraction (CCCSMF) accuracy.

Algorithm
InSilicoVA [14] is a Bayesian framework that improves upon InterVA [10] by using information about symptoms that are, and are not endorsed, to estimate probabilities for each cause of death in a way that is comparable across observations, and by estimating the individual-level and population-level predictions simultaneously. The model is estimated using Markov chain Monte Carlo (MCMC) simulations. To produce usable results, the algorithm must run a sufficient number of samples to ensure convergence. The authors have released their algorithm as an R package with a computationally intensive MCMC calculation implemented in Java through the rJava package. The algorithm utilizes a matrix of conditional probabilities between each cause and each symptom. These propensities, which the authors call the probbase, capture the user's initial estimate of the relative likelihood of a symptom being endorsed for a given cause of death. These estimates can be derived from data or from expert judgement. The R package allows the user to input his own probbase file and also provides a default probbase based on the InterVA project. Open-source code (licensed under the GNU General Public License version 2) for the R implementation of InSilicoVA is freely available online.

Data
We used the publicly available PHMRC gold standard database [15,16] to validate the InSilicoVA algorithm. This dataset contains VAs matched to cause-of-death diagnoses from medical records, with variable confidence. Cases in the dataset were initially identified from deaths in hospitals where strict,predetermined diagnostic criteria were satisfied. This ensured that the true cause of death was known with greater certainty than is often the case for deaths recorded in vital registration systems, where diagnostic misclassification is typically estimated to range between 30 and 60% [17,18]. After identifying cases, blinded VAs were collected using a modified version of the World Health Organization (WHO) VA instrument. This resulted in a validation database of 12,530 records for which the true cause of death was known with reasonable certainty, and for which a full VA interview had been conducted.
VAs were collected from six sites in four different countries: Andhra Pradesh, India; Bohol, Philippines; Dar es Salaam, Tanzania; Mexico City, Mexico; Pemba Island, Tanzania; and Uttar Pradesh, India between 2007 and 2010. The database includes deaths for 7841 adults, 2064 children, 1620 neonates, and 1005 stillbirths. Following practice from previous research, we used the most aggregated cause list with 34 adult causes, 21 child causes, and 6 neonate causes (including stillbirth) to assess the accuracy of cause-of-death predictions. These cause-of-death lists are shown in Additional file 1.

Validation framework
In this study, we follow the recommendations of Murray et al. for validating VA diagnostic methods [12]. For this procedure, the validation dataset is randomly divided into a train fold containing 75% of the observations and a test fold containing the remaining 25% of observations. This is repeated 500 times, resulting in 500 test-train sets, each with a different subset of the original observations. For each test-train set, any given record appears in either the train set or the test set, but not both. The test set is then resampled to an uninformative Dirichlet distribution. This ensures that the cause compositions of the train and test sets are uncorrelated, which provides a more robust measure of performance (for example, it prevents a naive prediction algorithm from guessing an accurate population-level distribution without utilizing information at the individual level). Additionally, because the cause composition varies substantially across the 500 test-train splits, it ensures that the algorithm is tested on datasets with a wide variety of cause distributions and that performance estimates are not skewed by overfitting to the most common cause in the training data. To assess performance at the individual level, we use the median CCC across causes [12]. To assess performance at the population level, we use CCCSMF accuracy [19]. CCC for a single cause is calculated as: where TP j is the number of true positives for cause j, TN j is the number of true negatives, and N is the number of causes. Values range between −1.0 and 1.0, where 1.0 indicates perfect ability to detect (i.e., diagnose) a cause, 0.0 indicates random guessing, and negative 1.0 indicates no ability to detect a cause. To create an overall metric of individual-level prediction accuracy, we use the mean of the cause-specific CCCs. Cause-specific mortality fraction (CSMF) accuracy is calculated as: where CSMF true j is the true fraction for cause j and CSMF pred j is the predicted fraction for cause j. This statistic can be corrected for chance (see Flaxman et al. [19]); we calculate the CCCSMF accuracy as: Similarly to CCC, perfect CCCSMF accuracy is attained at value 1.0, and values near 0.0 indicate that the diagnostic procedure being applied is essentially equivalent to random guessing.

InSilicoVA validation
The InSilicoVA R package allows for a range of customizations to the inputs used to predict the cause of death. We validate the algorithm using three different configurations of inputs to assess its usability and performance. These configurations are obtained as follows: (1) using the built-in default training data, (2) training the algorithm with inputs that resemble the defaults, and (3) training the algorithm with inputs that do not resemble the defaults. Following the practice established in Murray et al. [12], we also conduct the analysis without predictors derived from questions related to previous contact with the health care system. This produces estimates of diagnostic accuracy that could be more appropriate for generalizing to community deaths where the decedents had no medical contact [16]. For each of the three configurations, we test all three age groups both with and without health care experience questions.

With default probbase
The default configuration assumes the input data matches the InterVA4 format with 245 symptoms. It uses the conditional probabilities from InterVA to predict one of 60 causes. With the default configuration, no ancillary training data is required. To validate the default configuration, we mapped the PHMRC database to the InterVA format, and then we used InSilicoVA to predict the cause of death. We then mapped the predicted causes to the PHMRC gold standard list. We compared these mapped predictions to the known underlying cause as listed in the PHMRC database to calculate performance. Since the algorithm was not trained empirically with this configuration, we used the entire validation dataset to test the predictive performance. However, it is still essential to test the algorithm on datasets with different cause compositions, so we repeated this process on 500 test datasets, each with a cause composition drawn from an uninformative Dirichlet distribution and samples drawn from the complete dataset with replacement according to this cause composition. The predicted causes included 36 adult causes of death, 20 child causes, and 7 neonate causes. Of the 245 symptom predictors used by InSilicoVA, the PHMRC dataset contained data for 123 adult symptoms, 69 child symptoms, and 62 symptoms for neonates.

With empirical probbase
Next, we assessed how InSilicoVA performed with training data that matched its expected inputs. For this assessment, we mapped the PHMRC database to the InterVA symptoms, and the "gold standard" causes were mapped to the predicted causes. For each of the 500 test-train splits, we used the train split to calculate the empirical probability of an InterVA symptom being endorsed, conditional on the mapped cause. This conditional probability matrix was used as the input probbase for the algorithm. The test split was resampled to a Dirichlet cause distribution, and the algorithm predicted a cause from the default set of causes.
With empirical probbase matching Tariff 2.0 Finally, we assessed how the algorithm performed with training data of a different format than the standard inputs. For this assessment, the PHMRC database was mapped to the set of symptoms used by the Tariff 2.0 algorithm [7]. The data was mapped to 171 adult symptoms, 86 child symptoms, and 110 neonate symptoms. For each of the 500 test-train splits, we used the train split to calculate the empirical probability of a Tariff 2.0 symptom being endorsed conditional on the original PHMRC gold standard cause. We then used this empirical probability matrix in the InSilicoVA algorithm to predict causes of death. As before, we predicted for data in the test split after it had been resampled to a Dirichlet cause distribution. Of the three assessments, this configuration should be the most favorable towards InSilicoVA since it avoids any possible discrepancies between definitions of the PHMRC causes and the default causes, and it provides more symptom predictors for the algorithm to use.
The InSilicoVA R package provides 10 hyperparameters which allow users to tune the estimation procedure. Except where specifically mentioned, we used the default value provided by the InSilicoVA packages. The validity of the results depends on the Monte Carlo experiment successfully converging to a stable result. We repeated each experiment using three times the default number of simulations and assessed the number of splits that converged and any differences in the results. Convergence was assessed using the Heidelberger and Welch test included with the R package. We used the extract. prob function provided by the InSilicoVA package in all training exercises. Tables 1 and 2 show the algorithmic performance of InSilicoVA at the individual level and population level, respectively, using the default probbase, training the algorithm on data with the same causes and symptoms as the default probbase, and training the algorithm on data with different causes and symptoms. At both the individual and population levels, the configuration using the causes published with the dataset and the Tariff 2.0 symptoms performed best across all age groups regardless of whether health care experience (HCE) variables were included. These variables are intended to reflect the impact of the extent of contact with health services prior to death in terms of additional information that might improve diagnostic accuracy.

Results
At the individual level, InSilicoVA performed best for predicting the cause of death for child deaths. Without HCE variables, the median CCC for child VAs was 29. 2% (UI 29.0%, 29.4%) using the default probbase, 35.8% (uncertainty interval (UI) 35.5%, 36.3%) when training the algorithm on the default cause list and symptoms, and 38.8% (UI 38.4%, 39.5%) when using the causes and symptoms which best matched the data. For adults and neonates, InSilicoVA performed substantially worse with the default probbase than with the Tariff 2.0 causes and symptoms. The CCC for adults was 16 At the population level, InSilicoVA performed best in predicting the CSMF for neonates when provided with training data. The algorithm performed substantially worse than chance for all age groups using the default probbase, despite predicting better than chance at the individual level for adults and children. The median CCCSMF was −59.4% (UI -61.7%, −57.7%) for adults, −46.2% (UI -48.4%, −43.6%) for children, and −39.9% (UI -43.8%, −32.1%) for neonates. The median CCCSMF was higher for child and neonate age groups when using the Tariff 2.0 causes and symptoms. For adults, the performance was the same when using the InterVA or Tariff 2.0 training. The CCCSMF was 2.1% (UI 0.5%, 3.9%) for adults, 22.3% (UI 20.7%, 23.9%) for children, and 37.6% (UI 33.7%, 40.8%) for neonates.
At both the individual level and the population level, Tariff 2.0 outperformed InSilicoVA in all age groups. At the individual level without HCE variables, the median CCC across splits was 9.3 percentage points higher for adults, 5.8 percentage points higher for children, and 4.5 percentage points higher for neonates using Tariff 2.0 to diagnose the VAs, compared to InSilicoVA. At the population level, the median CCCSMF for Tariff 2.0 was 21.0 percentage points higher for adults, 8.2 percentage points higher for children, and 11.6 percentage points higher for neonates. Figure 1 shows the individual-level and population-level performance of InSilicoVA using different configurations compared to Tariff 2.0. The cause-specific performance of InSilicoVA tended to follow a similar pattern as the Tariff 2.0 algorithm when trained using the same symptoms as predictors, except that the Tariff 2.0 concordance was generally higher. Across the specific age groups, InSilicoVA had higher concordance only for Drowning, Lung cancer, Maternal, Stomach cancer, and Suicide in adults; AIDS, Drowning, Malaria, Other defined causes of child deaths, Other digestive diseases, Other infectious diseases, and Pneumonia in children; and Birth asphyxia, Meningitis/sepsis, Preterm delivery, and Stillbirth in neonates.
Across all age groups, InSilicoVA had higher sensitivity for 22 of 61 PHMRC causes for at least one of the with HCE/without HCE scenarios. It had higher specificity for 32 of 61 PHMRC causes. Table 6 shows the median sensitivity and specificity across cause for InSilicoVA and Tariff 2.0. Overall Tariff 2.0 had higher sensitivity for all age groups with and without the health care experience predictors. InSilicoVA had comparable specificity to Tariff 2.0 for adults and children, but slightly lower specific for neonates. Additional file 5 shows cause-specific comparisons of InSilicoVA and Tariff 2.0 using sensitivity and specificity.
Further, when using training data, the model did not always converge for every test-train split. Across the three modules and different mappings of training data, for 81.5-4.7% of the 500 test-train splits the model did not converge when using the default number of Monte Carlo simulations. We increased the number of simulations performed during the fitting process to three times the default to see if the model would eventually converge. Even with these extra samples, up to 27.8% of splits still failed to converge for some configurations.

Discussion
As expected, InSilicoVA performed best when using the causes and symptoms that closely matched the data. The differences between using the causes and symptoms from the data versus mapping to the InterVA causes and symptoms were greatest for neonates. The differences in population-level accuracy were generally larger than at the individual level. Even when using the ideal configuration, InSilicoVA always had lower diagnostic accuracy than the Tariff 2.0 method. The difference was greatest for adults where, without health care variables, the predictive accuracy of InSilicoVA was 9.3 percentage points lower at the  Table 2 shows the population-level performance as the median value and uncertainty interval (UI) across 500 test-train splits using different probbase matrices for prediction, by age group, with and without health care experience (HCE) questions included. InSilicoVA was run without training using the default probbase, with an empirical probbase derived from training data mapped to the InterVA format, and with an empirical probbase derived from training data mapped to the Tariff  where, in all cases, the vast majority of deaths occur among the adult population [20]. We have reviewed InSilicoVA for two complementary purposes. First, we assessed the performance of the  InSilicoVA method as a diagnostic algorithm for verbal autopsy. Second, InSilicoVA is a new piece of software that potentially could be applied routinely into vital statistics systems for deaths without physician certification.
Knowing that this is a potential use for this software, it is obviously important that the method can be easily applied, and with confidence about diagnostic accuracy, in settings with little technical and statistical support. The need for continuous vetting of model input parameters and verification of model convergence is likely to be    problematic in many countries, and is likely to result in low-quality cause-of-death statistics in countries where there are insufficient resources to procure these services. Compared with Tariff 2.0, we found that InSilicoVA performs significantly worse in correctly predicting causes of death. We were not able to identify any configuration of input parameters, for any age group, that outperformed published estimates from the Tariff 2.0 algorithm. InSilicoVA shows the most promising results for child and neonates, despite having noticeably fewer Fig. 1 Comparison of InSilicoVA and Tariff 2.0 at the individual and population levels. Note: Individual-level accuracy is assessed using chance-corrected concordance. Population-level accuracy is assessed using chance-corrected cause-specific mortality fraction (CSMF) accuracy. Values of zero in either dimension are equivalent to random guessing and range up to 100% for perfect accuracy. InSilicoVA is tested using the default expert-derived probbase, a probbase empirically trained using InterVA symptoms, and a probbase empirically trained using Tariff 2.0 symptoms. Published accuracies of Tariff 2.0 are shown for comparison  Table 6 shows the sensitivity and specificity across causes for InSilicoVA using an empirical probbase derived from training data mapped to the Tariff 2.0 format. Previously published Tariff 2.0 results are shown for comparison symptom predictors for these age groups, but even for these age groups it still has noticeably lower diagnostic accuracy than Tariff 2.0. This result is generally consistent when comparing cause-specific performance between the two algorithms. For a few causes, InSilicoVA had higher CCC. However, the increased sensitivity was at the expense of other causes, which had significantly lower concordance and may indicate that the model overfits to causes which may be easier to detect. This is especially evident for the neonate, where InSilicoVA achieved higher concordance for four causes, but predicted the other two causes at level equal to chance, as indicated by the uncertainty interval containing zero. This is in contrast to Tariff 2.0, which performed similarly across causes, with the exception of Stillbirth, which had high concordance for both algorithms.
To predict with this algorithm, users must decide what conditional probability matrix to use. The InSilicoVA authors propose that, in practice, ranked conditional probabilities be derived from expert panels that rank the propensities of seeing a symptom given a particular cause of death [14]. They show that the predictive accuracy of the method is heavily dependent on the quality of this input. However, deriving these probabilities may not be straightforward. The required value is the probability of a respondent saying the decedent had a given symptom. This is subtly but importantly different from the probability of the decedent having the symptom. The value needed for this algorithm requires that a decedent had a symptom, the decedent communicates this symptom to someone or someone notices it, the interviewer finds this person who knew about the symptom, and the respondent remembers the symptom months later when the VA interview is being conducted. The respondent may not notice or may forget key symptoms. When medical professionals create these ranked conditional probabilities, they may implicitly estimate the probability of identifying a symptom themselves in their expert, clinical evaluation. This value could mislead the algorithm and result in inaccurate predictions. It is necessary that experts who select these conditional probabilities balance both the presentation of symptoms due to a disease and the ability of non-experts to reliably identify, remember, and report on these symptoms.
We report here, for the first time, the predictive performance of InSilicoVA using the default conditional probabilities (from InterVA). Given resource constraints in the settings where VA is likely to be used, and the logistical difficulties of collecting location-specific probbase information from medical professionals familiar with the area, it is quite likely that the InSilicoVA defaults will be used in practice. We found that the default configuration and conditional probabilities consistently perform worse than chance at all ages at the population level. The authors claim that InSilicoVA is applicable in a wider range of settings because it does not need to rely on "gold standard" data [14]. However, we have demonstrated that using expert-derived training as opposed to empirically derived training data results in unacceptably poor performance.
The results from this study match a previous validation of the InterVA algorithm, which found that, once corrected for chance, population-level accuracy of predictions using an expert-derived probbase are relatively poor [21]. The InterVA probbase used by InSili-coVA has undergone extensive field testing and review by numerous investigators in multiple countries [22]. Given this, we believe it is extremely unlikely for expertderived probbases to produce estimates that rival empirically derived training such as that used by Tariff 2.0. Additionally, expert-derived training has the unfortunate effect of often appearing plausible, since it reconfirms the intuition of the experts training and evaluating the method, which can be, and often is, incorrect. The net result is a situation in which diagnostic information being provided by InSilicoVA is likely to be worse than acting on no information whatsoever.
In this study, we used test data with a cause distribution uncorrelated with the training data. This resulted in scenarios in which the training data and test data were sufficiently different that the model could not successfully converge. The R package displays a warning about non-convergence and says the results may be unreliable, but it still yields outputs. This raises two operational considerations with the use of InSilicoVA. First, it is possible to create a conditional probability matrix in which the model does not successfully produce reliable results. Second, the R package produces results even in this circumstance. It is possible that InSilicoVA users may unintentionally overlook the warning that the MCMC process has not converged, leading to adoption of results which are known to be statistically inaccurate.
Installing Java and properly configuring R and Java to work together requires considerable technical expertise and is not standardized across different computer systems. Although InSilicoVA is freely available, it may require expert technical consultation to be usable.

Conclusions
Verbal autopsy as a diagnostic method is now being actively considered by countries for routine widespread use in surveillance and vital statistics systems [23]. It is important to keep improving the science behind estimation and validation of different cause-of-death prediction strategies so that policy makers can be provided with the highest quality estimates based on the best possible measurement methods. It is also important that methods be independently investigated and evaluated for usability