Our randomized trial compared COD assignment by six current automated algorithms to physician assignment, and avoided the inherent problems in observational comparisons of algorithms including the fact that they were trained and tested on differing datasets [18, 24]. The trial adopted rigorous quality control in training, data collection, and coding that yielded high-quality data in both trial arms. Physicians allocated records randomly within their arm, and the physician and automated arms were well-balanced in the overall distribution of the key symptoms predicting COD. We randomized about 50% more deaths than planned originally and had sufficient statistical power to detect high concordance for deaths in each age group. The six algorithms varied widely in their concordance with the standard even for causes that by common sense are easy to identify, such as injury, cancers, or suicide. The range of concordance in our trial overlaps with that in observational studies. However, the veracity of the randomized results is far greater. No one algorithm consistently performed better than the others, with variation in specific diseases. Hence, claims of superiority of any one algorithm [13] carry little scientific credibility.
Physician assignment of COD is the global standard for medical certification of cause of death [4]. Inevitably, the quality of information in VA will be lower than from medically certified deaths occurring in health facilities. However, VA is quite accurate for deaths in children and among young and middle-aged adults (but is less accurate in deaths in older age) when compared to clinical information in hospitals, death certificates, or cancer registry data [6, 7, 22, 23]. Initial agreement by two physicians on the COD was quite high. Importantly, VAs are valuable precisely in settings lacking facility-based certification. Despite the inherent misclassification, VAs are valuably informative compared to no evidence (which is the most common scenario in most countries) and compared to modeled mortality patterns [1]. Though we used physician assignment as the reference, it would be misleading to claim physicians as a “gold” standard, as none exists [1, 7, 12]. Unattended deaths, by definition, cannot be conclusively categorized, and hospital-based deaths cannot adequately reflect home deaths [1].
Additional comparisons enabled in this trial offer reasonable assurance that the use of lay reporting with dual independent physician assignment yields reliable and comparable COD distributions over time and place. First, physician-assigned deaths used as the standard in the trial were distributed similarly to deaths assigned (also by physicians) in the same geographic areas in the most recent data of the MDS (see Additional file 16). Earlier comparison of a 3% random sample of deaths within the MDS showed similar high reproducibility of 94–92% for adults and children below age 5 years (see Additional file 17) [3, 25]. Non-medical VA reporting by field staff provides comparable results to the (far less practical) approach of physicians interviewing VA respondents (see Additional file 18) [13, 26]. Finally, physician assignment of deaths in the automated arm (done only using the list of symptoms without a narrative) yielded concordance of 82–91% with the standard, better than that for algorithms albeit with some variability for specific conditions like ischemic heart disease (see Additional files 11 and 12).
The inadequate performance of current automated algorithms is likely a result of several complementary factors [1, 3, 7, 27, 28]: (i) the intrinsic limitations of each algorithm; (ii) the fact that the PHMRC dataset appears to be customized mostly to build SmartVA (indeed, InterVA-4 and InSilicoVA-NT, which do not require training, generally performed better than algorithms which did, and SmartVA yielded a surprisingly high proportion of ill-defined deaths in adults versus the proportion reported earlier on the PHMRC data [13]); (iii) the PHMRC hospital-based deaths differ substantially from unattended home-based deaths in the education levels, pathogen distribution, and symptom cause information [11, 16, 24, 29]; and (iv) inadequate quality and size of training and testing data (particularly for children and neonates) that limit the ability for algorithms to generate adequate symptom cause information for COD predictions (see Additional files 4 and 14).
Further development of automated methods is desirable, but requires much larger, randomly selected unattended deaths, with sufficient sample size to test combinations for different causes [16]. Currently, it is not possible to specify a priori which algorithm to use for which specific COD. Theoretical, but as yet impracticable, combinations of algorithms would perform much better than individual algorithms (see Additional file 10). Understanding the microbiological status for bacterial and viral infections and pathophysiological processes (such as cerebral edema for malaria) of childhood deaths is now being supported by the Gates Foundation. This may help improve future verbal autopsy tools (and assignment guidelines) by comparing the sensitivity and specificity of symptoms with biological confirmation, particularly if the sampling includes sufficient numbers of home deaths [28, 30]. Natural language processing on VA narratives has also yielded promising results [3]. Narratives contain valuable information on chronology, care-seeking behavior, and social factors which are difficult to capture in checklist interviews [7].
Our results further suggest that programs planning to use automated assignment should retain local language narratives for dual-physician coding. Our trial requires replication in sub-Saharan Africa, where a much higher prevalence of HIV and malaria would result in different mortality patterns to those seen in India.
Considerations of the financial and opportunity costs of physician coding are secondary to the question of accuracy, but information from this trial suggests that the concerns may be misplaced. The entire cost of field work, data collection, and coding per house was less than US $3 (and US $1 in the MDS) [3, 31]. About two thirds of the costs are for the requisite field interviews. Only about one quarter of costs are for physician assignment [3, 7]. The electronic platform used in this trial (and in the MDS) enables physicians to work part-time, typically during evenings, therefore not diverting them from other clinical or public health duties. This study and the MDS reinforce the need to have a large, geographically distributed number of physician coders, so as to help counter biases of any one physician in coding [7]. Standard panels for physician coding and a central pool of doctors to re-code VAs globally would also boost cross-country comparability [4].
Our trial supports the need to develop simpler, cheaper VA field methods [7]. Paradoxically, the 2016 WHO VA forms have 50% more questions than the 2012 version, reaching 346 questions in the adult form (in part to feed demands made to WHO by algorithm designers). Though the MDS has only 68 questions on the adult form, it yielded comparable COD distribution to the longer trial forms (see Additional files 16 and 17). Shorter forms enable quicker interviews that are more likely to retain respondents’ interest, reduce surveyor time costs, and thus enable larger sample sizes [31]. Simplification and reduction of the questions is a priority, while maintaining the ability to use either physician or automated assignment. Ideally, dual independent physician assignment can improve performance and reduce biases of single coding [11]. However, dual physician assignment may not be practical in all settings. Further research on combinations of single-physician coding and resampled second coding, or indeed combining physician and algorithm coding, is required.