Our general approach was first to establish a rank order of individual question items in the PHMRC VAI in terms of their importance in predicting COD. We did this using the Tariff 2.0 Method [17] to predict the COD for each VA in the PHMRC Gold Standard database and by comparing the predicted COD with the gold standard cause. Second, we reduced the size of the instrument by dropping items in reverse order of their importance. We assessed the predictive performance of the instrument at each stage of item reduction by calculating chance-corrected concordance (CCC) at the level of the individual and cause specific mortality fraction (CSMF) accuracy at the level of the population (see below). Finally, the optimum size of the shortened instrument was determined using a first derivative analysis of the decline in performance as the size of the VAI progressively decreased. We followed the same approach for adults, children, and neonates.
PHMRC gold standard validation study database
The general methodology of the PHMRC study has been described in detail elsewhere [16]. In summary, VAs were collected from six sites in four countries: Andhra Pradesh and Uttar Pradesh in India, Bohol in the Philippines, Mexico City in Mexico, and Dar es Salaam and Pemba Island in Tanzania. Methods were approved by the Internal Review Boards of the University of Washington, Seattle, WA, USA; School of Public Health, University of Queensland, Australia; George Institute for Global Health, Hyderabad, India; National Institute of Public Health, Mexico; Research Institute for Tropical Medicine, Alabang, Metro Manila, Philippines; Muhimbili University, Tanzania; Public Health Laboratory Ivo de Carneri, Tanzania; and CSM Medical University, India. All data were collected with prior informed consent. Gold standard clinical diagnostic criteria for hospital deaths were specified for an initial list of 53 adult, 27 child, and 13 neonatal causes including stillbirths, chosen on the basis of epidemiological criteria and the likely ability of VA to identify the cause (Additional file 1). This was known as the target cause list. Deaths with hospital records fulfilling the gold standard criteria were identified in each of the sites. The PHMRC VAI was used to interview families about the events leading to each of these deaths [16]. Interviewers were blinded to the COD assigned in the hospital. The PHMRC database contains 12,501 verbal autopsies with gold standard diagnoses (7,846 adults, 2,064 children, 1,586 neonates, and 1,005 stillbirths).
The PHMRC VAI includes both closed-ended questions and an open-ended narrative. Questions covered: 1) symptoms of the terminal illness, 2) diagnoses of chronic illnesses obtained from health service providers, 3) risk behaviors (tobacco and alcohol), 4) details of any interactions with health services, and 5) details about the background of the decedent and about the interview itself. Not all of these questions contributed to prediction of the COD. Questions that were converted to binary variables – the necessary basis for Tariff analysis and the prediction of COD – we refer to as question items. Text items were derived from open-ended narrative using a text mining procedure (Text Mining package in R (version 2.14.0) [18]), which identifies keywords and groups words with the same or similar meanings. Performance in this paper is reported as being 1) with text, 2) without text, and 3) with a checklist. The checklist uses only a selected subset of text items as described later.
Tariff 2.0
The Tariff Method is based on a simple additive algorithm that creates a score, or tariff, for each questionnaire item and uses these scores to assign COD [10, 17]. Ideally, an item would have a high tariff for just one COD and a low tariff for all others; the model would then differentiate readily between causes [10]. For example, the item “Decedent suffered drowning” has a strong association with a few causes of death (accidental drowning, homicide, and suicide) and carries high tariffs for those causes. On the other hand, the item “Decedent had a fever” is associated with many different causes of death and carries low tariffs for the causes it is associated with. Tariffs for drowning have high standard deviations, while tariffs for fever have low standard deviations. Items with high standard deviations were considered more important for diagnosis than were tariffs with low standard deviations. To determine their order of importance, items were ranked by standard deviation. This was done separately for each module (adult, child, and neonate).
Measurement of performance
Simulated populations
The performance of a VA method in assigning a COD is a function of the true cause of death composition in the study population [19]. Therefore, for the development of a VA diagnostic method or a new VAI it is important to validate the method or instrument in as many populations with different cause compositions as possible. This is made practicable by means of computer simulation: 500 populations with random cause compositions were created based on the PHMRC dataset for the development and validation of the original suite of VA methods [16]. In the present study, every test of performance of different length instruments was done using the same 500 randomly generated populations. The 500 train-test data analysis datasets were generated by holding 75 % of the dataset as “training” data and 25 % as “test” data. Each test dataset was resampled using a Dirichlet distribution to obtain a random CSMF composition for each simulated population. Training data were used to generate the model. Analysis of test data was blinded to the gold standard COD. The accuracy of COD predictions was assessed using the performance metrics. This process is described more fully in Additional file 2.
Performance metrics
For policy, research, and surveillance it is important to be able to quantify the actual performance of a VA method in predicting the COD, correcting for chance at both individual and population levels. We assessed performance of the progressively shortened VAI using Cohen’s Kappa, CCC, sensitivity, specificity and CSMF accuracy.
CCC measures sensitivity adjusted for chance and was used to assess the extent to which Tariff 2.0 correctly predicted an individual cause of death when applied to the shortened VAI. A perfect prediction has CCC equal to one, while a random allocation would have, on average, CCC equal to zero. CCC is calculated as follows:
$$ CC{C}_j=\frac{\left(\frac{T{P}_j}{T{P}_j+F{N}_j}\right)-\left(\frac{1}{N}\right)}{1-\left(\frac{1}{N}\right)} $$
where TPj is true positives, or the number of decedents with gold standard cause j assigned correctly to cause j, FNj is false negatives, or the number of decedents incorrectly assigned to cause j, and N is the number of causes analyzed. The sum of TPj and FNj is the total number of deaths due to cause j.
Performance was also measured at the population level using mean CSMF accuracy across the 500 cause compositions, calculated as
$$ \mathrm{CSMF}\ \mathrm{accuracy}=1-\frac{{\displaystyle {\sum}_{j=1}^k\left|{\mathrm{CSMF}}_j^{true}-{\mathrm{CSMF}}_j^{pred}\right|}}{2\left(1-\mathrm{Minimum}\left({\mathrm{CSMF}}_j^{true}\right)\right)} $$
where the numerator in the calculation is the sum of the absolute error for all k causes between the true CSMF and estimated CSMF, and the denominator is the maximum possible error across all of the causes. CSMF accuracy will be one when the CSMF for every cause is predicted with no error.
Developing a shortened verbal autopsy instrument
To begin, we removed questions about the background of decedents from the full PHMRC VAI. We then turned the remaining questions into binary indicators, or items, as described above. Thus, 183 adult, 127 child, and 149 neonatal questions were converted into 170, 80, and 117 question items, respectively. Next, we ranked these items (1–170, 1–80, and 1–117) according to their importance, as defined by the standard deviation of their tariffs. We then systematically reduced the size of the instrument by 10 question items at a time in the order of their importance, as ranked by their tariff standard deviations. With each successive reduction in the number of items, we measured both CCC and CSMF accuracy using the 500 simulated populations as described above. We analyzed the performance of question items with and without text to assess the importance of text as the number of question items decreased. We then used a cubic spline to interpolate between these CCC and CSMF accuracy values to derive a continuous performance curve. Based on this curve, we identified the points (i.e., residual number of items) where each of the metrics (CCC with text, CCC without text, CSMF accuracy with text, CSMF accuracy without text) began to decrease at a significantly negative rate. This was done by taking the first derivative of the continuous performance curves for both CCC and CSMF accuracy. The optimum size of the shortened VAI for each of the three age groups was determined by the number of items that immediately preceded any significant decrease for at least one of these four metrics. These items, which had been ranked in order of importance, formed the basis for the final shortened VAI.
To complete the VAI, we also added questions that would enable the shortened version to function as a stand-alone instrument in a survey. In particular, we inserted questions to preserve the sense and flow of the instrument: for example, an important question was, “Did [name] cough blood?” but this needed to be preceded by the question, “Did [name] have a cough?” We also retained questions relevant to health service utilization and decedent background.
We then piloted the shortened VAI in three sites in the Philippines, Sri Lanka, and Bangladesh to assess its logic and applicability using Android tablets and the open source software, Open Data Kit (ODK) [20].
Checklist for open narrative
The Tariff Method uses a set of the top-ranked 40 items for each cause prediction based on standard deviation of each item’s tariff [10]. In Tariff 2.0, 43 % of items used in the prediction of all 34 causes in adults were text items derived from open narrative that had been translated into English [17]. We, therefore, concluded that it was critical that we include open narrative in the shortened form of the instrument. We found, however, that we had failed to take into account the difficulties that interviewers would experience in entering open narrative directly onto the tablet. This was a consequence not only of shifting between languages but also between Bengali and Sinhala scripts and the Latin script used for English. During the field trial, some field staff had taken notes on paper, which they transcribed in the office to record the open narrative section. This process took more time and effort than any other component of data management and was a potential source of error. Such difficulties were compounded by the limited character sets for non-Latin scripts on the tablets and the much more extensive training required to enter lengthy text data into a tablet. We, therefore, developed a checklist of keywords to use in the open narrative rather than having interviewers record and transcribe an entire conversation.
This checklist comprised a list of words that were endorsed by the interviewer when mentioned by the respondent in describing the circumstances surrounding the death. These words could be converted directly into English and subjected to text mining.
Using the 500 simulated populations we measured the independent effect on performance of the addition of single text items to the shortened VAI: i.e., on CCC overall, on CCC by cause, and on CSMF accuracy (all question items plus a single text item). This was done separately for the adult, child, and neonate modules. The length of the final checklist for each of the three modules was decided on practical grounds: the checklist needed to fit on a single screen and could not have more words than could easily be remembered by the interviewer during the conversation. It was thus limited to a maximum of 12 text items. The final selection was based both on the items’ contributions to performance and on their significance for the diagnosis of diseases of public health importance.