Machine learning for personalized medicine
Defining better clinical endpoints
Many methodological as well as applied articles focus on simple yes/no decision tasks, e.g., disease progression / no disease progression or clinical trial endpoint met /not met. This is surprising, because machine learning research offers a comprehensive arsenal of techniques to address clinical endpoints beyond binary classification, such as, real valued, time-to-event, multi-class or multivariate outcomes. Models with binary outcomes can be appropriate in specific situations, but in many cases, an appropriate clinical outcome is more complex. For instance, the commonly used response criterion for rheumatoid arthritis, a debilitating autoimmune disease of the joints, is based on the DAS28 disease score [26], which ranges on a continuous scale from 0 to 10 and is often discretized into three consecutive levels (low, medium, high disease activity).
The DAS28 score itself combines four components in a nonlinear equation, namely the number of swollen joints, the number of tender joints, plasma levels of CRP protein, and an assessment of the patient’s global health as estimated by a physician. These components vary from discrete to continuous and from subjective, physician-dependent assessments to more objective measurements of biomarkers.
Another example is the prediction of response to anti-epileptic drug treatment. While at first glance overall seizure frequency reduction after a given number of weeks relative to baseline seems to be an appropriate endpoint in agreement to common practice in clinical trials, this choice in fact neglects the existence of different seizure types as well as the potential temporal modifications of these seizure types due to treatment. Thus, other and more complex (possibly multivariate) clinical endpoints might be necessary. We expect that a more careful choice of clinical endpoints as well as better technical monitoring capabilities (e.g., via mobile health applications and wearable sensors) will lead to more clinically useful prediction models in the future.
Defining appropriate model quality and performance measures
What makes a good model in personalized medicine? First, predictions must be accurate. As pointed out above, prediction accuracy must be assessed via a careful validation approach. Within such a validation procedure, it has to be decided how prediction performance will be measured. It appears that, in many studies, too much focus is given to standard, off-the-shelf metrics (e.g., area under the receiver operator characteristic curve), as compared to application-specific performance metrics. For instance, consider the case of predicting response to a first line therapy and assume that we can formulate this question as a classification task (responder vs. non-responder). Clearly, a perfectly accurate classifier is optimal. However, even a classifier that is mediocre with respect to overall accuracy might reliably identify those patients that will definitely not respond to the drug. The identified patients could immediately move on to a second line therapeutic and, thus, patient quality of life would improve and healthcare costs could be reduced. This example demonstrates the relevance of carefully defining appropriate prediction performance metrics.
However, prediction performance is only one aspect of judging the overall quality of a model. Another aspect is model stability, which reflects the degree to which a model (including variables selected by that model) remains the same if the training data is slightly changed. Model stability is a particular issue when working with gene expression data, where models trained on very different or even disjoint gene subsets can result in similar prediction performance regarding a given clinical endpoint, since highly correlated features can be substituted for each other [26]. Model stability should be routinely reported in addition to prediction performance.
Various methods have been developed for increasing the chance of obtaining a stable model during the development phase of a stratification algorithm. For example, inclusion of prior knowledge, such as biological networks and pathways, can enhance the stability and thus reproducibility of gene expression signatures [27,28,29]. Moreover, zero-sum regression [30] can be used to build classifiers that are less dependent on the employed omics platform (e.g., a specific microarray chip) [31], thus easing external validation, translation into clinical practice as well as long-term applicability of the model. We think that more frequent use of such methodology in conjunction with careful evaluation of model stability would lower the barrier for model transfer from discovery to external validation and finally to clinical application.
Tools for interpreting a machine learning model
As researchers collect and analyze increasingly larger sets of data, a greater number of sophisticated algorithms are employed to train predictive models. Some of the computational methods, in particular those based on deep learning techniques, are often criticized for being black boxes. Indeed, as the number of input features becomes large and the computational process more complex, understanding the reasons for obtaining a specific result is difficult, if not impossible. In many instances, for example, in the case of identification of disease markers, understanding the computational decision-making process leading to the selection of specific markers is, however, necessary and demanded by physicians. Using black-box models for medical decision-making is thus often considered to be problematic, leading to initiatives like the ‘right to an explanation’ law Article 22 of the General Data Protection Regulation propositioned by the European Union in April 2016/679. Similarly, in the process of drug development in pharmaceutical industry, regulatory agencies require transparency and supporting evidence of a molecular mechanism for the choice of specific biomarker panels.
While usefulness of data-driven prediction is increasingly recognized, a key requirement for credibility of such solutions is thus the ability to interpret them in the context of current biomedical knowledge. It is important to understand that the concept of interpretability covers a spectrum (Fig. 4). At one end of the spectrum, there is a detailed understanding of the exact (biochemical) molecular and pathophysiological mechanisms that link a model with a defined clinical endpoint. Typically, this level of insight is rarely achievable due to lack of knowledge.
A less detailed level of understanding is that of total causal effects of a predictor regarding the clinical endpoint of interest. For example, in a randomized controlled clinical trial, any difference in outcomes between the two treatment groups is known to be caused by the treatment (since the groups are similar in all other respects due to the randomization). Thus, although one may not know exactly how the treatment affects the outcome, one knows that it does. Such statements about total causal effects are more difficult to obtain in a setting outside clinical trials, where purely observational data from untreated patients are collected (e.g., cross-sectional gene expression data). Nonetheless, computational approaches have significantly advanced in this field over recent years and, under certain assumptions and conditions, allow for estimating causal effects directly from observational data [32, 33].
At a lower level of interpretability, gene set and molecular network analysis methods [34, 35] can help to understand the biological sub-systems in which biomarkers selected by a machine learning algorithm are involved. There also exists a large body of literature on how to directly incorporate biological network information together with gene expression data into machine learning algorithms (see [28] for a review).
Recently, the concept of ‘disease maps’ has been developed as a community tool for bridging the gap between experimental biological and computational research [36]. A disease map is a visual, computer-tractable and standardized representation of literature-derived, disease-specific cause–effect relationships between genetic variants, genes, biological processes, clinical outcomes, or other entities of interest. Disease maps can be used to visualize prior knowledge and provide a platform that could help to understand predictors in a machine learning model in the context of disease pathogenesis, disease comorbidities and potential drug responses. A number of visual pathway editors, such as CellDesigner [37] and PathVisio [38], are used to display the content of a disease map and to offer tools for regular updating and deep annotation of knowledge repositories. In addition, dedicated tools such as MINERVA [39] and NaviCell [40] have been developed by the Disease Map community. At this point in time, disease maps are more knowledge management rather than simulation or modeling tools, although intensive efforts are underway to develop the next generation of disease maps that are useful for mathematical modelling and simulation and become an integral part of data interpretation pipelines.
The least detailed level of understanding of a complex machine learning algorithm is provided by the analysis of relative importance of variables with respect to model predictions. Relative variable importance can be calculated for a range of modern machine learning models (including deep learning techniques), but the level of insight depends on whether only few out of all variables have outstanding relevance and whether these variables can be contextualized with supporting evidence from the literature. It is also not clear a priori if such variables are only correlated with or perhaps also causal for the outcome of interest. Finally, inspecting most important variables may be less informative in the case of highly collinear dependencies among predictor variables such as, for example, in gene expression data.
In addition to the interpretation of predictors there is a need from a physician’s perspective to better understand model predictions and outputs for a given patient. One obvious way might be to display patients with similar characteristics. However, the result will depend on the exact mathematical definition of similarity. Moreover, clinical outcomes of most similar patients will, in general, not always coincide with the predictions made by complex machine learning models, which could result in misinterpretations. The same general concern applies to approaches, in which a complex machine learning model is approximated by a simpler one to enhance interpretability, for example, using a decision tree [41, 42].
Data type-specific challenges and solutions
Real-world longitudinal data
Longitudinal EMR and claims data have received increasing interest in recent years within the field of personalized medicine [43, 44] since they provide a less biased view on patient trajectories than data from classical clinical trials, which are always subject to certain inclusion and exclusion criteria [45]. Specifically in the United States, a whole industry has grown to collect, annotate, and mine real-world longitudinal data (https://cancerlinq.org/about, https://truvenhealth.com/). The recent US$1.9 billion acquisition of Flatiron Health by the pharma company Roche (https://www.roche.com/media/store/releases/med-cor-2018-02-15.htm) marks the potential that is seen by industrial decision-makers in the context of drug development, pharmacovigilance, label expansion, and post-marketing analysis [45, 46].
Longitudinal real-world data pose specific challenges for training and validation of predictive models. Within the analysis of clinical real-world databases (e.g., Clinical Practice Research Datalink; https://www.cprd.com/home/) patients for a study cohort are typically selected based on a specified index date or event, which is often difficult to define and thus leaves room for different choices. Since the maximal observation horizon in real-world databases is often limited to a certain number of years (e.g., due to budget restrictions), some patients are longer observed than others. Specifically, claims data may contain gaps (e.g., due to periods of unemployment of patients) and the exact date of a diagnosis, prescription, or medical procedure cannot be uniquely determined. It is not always clear for the treating physician which ICD diagnosis codes to choose, and this leaves room for optimization with respect to financial outcomes. In addition, EMRs require natural language preprocessing via text mining, which is a difficult and potentially error-prone procedure in itself. In conclusion, development of a predictive model for personalized medicine based on real-world clinical data thus remains a non-trivial challenge.
Classically, validation of a predictive model relies on an appropriate experimental design and randomization. Real-world data often limits the options available for rigorous validation. Classical strategies, such as carefully crafted cross-validation schemes, can offer reliable validation, but they might be tricky to design, and the limits of such retrospective validation must be properly understood. Another option is the use of different time windows where only retrospective data up to a given date is used to develop a model, which is then used on the data available after this date. Such a setup can be close to an actual prospective evaluation, although the risk for biases is larger. Another option is to consider such analyses as only generating hypotheses, which are then followed up in a more classical fashion by setting up a carefully designed observational study manifesting the final validation. A more speculative possibility is the adaptation of so-called A/B testing techniques that are common in web development and software engineering [47]. This would entail randomization of patients for therapeutic options directly in the real-world environment. While such a setting is probably not feasible for drug development, it may be applicable to determine the efficacy of interventions in a real-world setting or to determine the right patient population for a given intervention.
Multi-modal patient data
There is an increasing availability of multi-scale, multi-modal longitudinal patient data. Examples include the Alzheimer’s Disease Neuroimaging Initiative (http://adni.loni.usc.edu/) (omics, neuro-imaging, longitudinal clinical data), the Parkinson’s Progression Markers Initiative (http://www.ppmi-info.org/) (omics, neuro-imaging, longitudinal clinical data), the All-of-Us Cohort (https://allofus.nih.gov/) (omics, behavioral, EMRs, environmental data), the GENIE project (http://www.aacr.org/Research/Research/Pages/aacr-project-genie.aspx#.WvqxOPmLTmE) (genomic and longitudinal real-world clinical data) and, specifically for multi-omics, the NCI’s Genomic Data Commons [48]. Multi-modal data provide unique opportunities for personalized medicine because they allow for capturing and understanding different dimensions of a patient. This aspect is in turn widely believed to be key for enhancing the prediction performance of stratification algorithms up to a level that is useful for clinical practice. Accordingly, there has been a lot of work in methods that combine data from different (omics-) modalities, see [49] for a review.
A major bottleneck in current studies collecting multiple data modalities of clinical cohorts is posed by the fact that different studies are often performed on cohorts of different patients and different experimental approaches are used across studies (see Fig. 5 for an example). As consequence, data from different studies becomes difficult or even impossible to integrate into a joint machine learning model. Several strategies are possible to reduce this problem in the future. A first strategy is to perform systematic multi-modal data assessment of each individual in a clinically rigorously characterized cohort, including longitudinal clinical and omics follow-up. In the more classical clinical setting, the success of the Framingham Heart Study (https://www.framinghamheartstudy.org/) comes to mind, which is a long-term study about risk factors for cardiovascular diseases running since 1948. While, in the future, we will analyze larger and larger volumes of real-world data, we should be aware of the limitations of such data (interoperability of data from different sources, non-systematically collected data, measurement quality, inconsistencies and errors, etc.). Rigorous multi-modal observational studies are essential for establishing reliable baselines for the development of real-world models. Ideally, multi-modal data would be collected longitudinally in regular intervals for all subjects. While this has been achieved for individual studies [50], for practical and economic reasons, this is likely to be limited to a small number of cohorts. A second approach is to have some overlap among patients across different cohorts. Statistical methods and machine learning can then be used to ‘tie’ different datasets together. A third approach is to collect a joint modality (such as standardized clinical data or biomarkers) across different studies. This joint modality again makes it possible to tie together different datasets. It must be stressed that this problem of disconnected cohorts is currently a major obstacle for leveraging multi-omics data.
It should be emphasized that, ideally, multi-modal, multi-omics data should be considered in conjunction with longitudinal clinical data. Despite of the examples mentioned above (Alzheimer’s Disease Neuroimaging Initiative, Parkinson’s Progression Markers Initiative, All-of-Us Cohort) we are currently just in the beginning of performing corresponding studies more systematically. The combination of multi-omics with real-world longitudinal data from clinical practice (e.g., EMRs) and mobile health applications marks a further potential for personalized medicine in the future. The GENIE project is an important step into this direction.
Translating stratification algorithms into clinical practice
The ability to accelerate innovation in patient treatment is linked to our ability to translate increasingly complex and multi-modal stratification algorithms from discovery to validation. Stratification in clinical application means assigning treatment specifications to a particular patient, which may include type, dosage, time point, access to the treatment, and other pharmacological aspects. The validation of such algorithms is usually performed via internal validation (cross-validation), external validation (using a separate patient cohort), and prospective clinical trials compared to the standard of care [10] (http://www.agendia.com/healthcare-professionals/the-mindact-trial/). Proper validation constitutes a requirement for translating these methods to settings in which they can generate impact on patient outcomes. In addition to classic healthcare providers, such as hospitals and general practitioner, mobile health applications and wearable sensors might play an increasing role in the future. As described earlier, integrating multi-modal data is key for gaining new insights and lies also at the heart of stratifying patients for diagnostic, predictive, or prognostic purposes. However, considerable barriers exist regarding the integration of similar data from different cohorts, normalization of data across measurement platforms, and the ability to process very large volumes of data in appropriate systems close to or within the clinical infrastructure remains limited. Strictly controlled cloud services, which appropriately protect patient data, could be an approach to alleviating this limitation [51]. At this point it might be possible to learn from organizations that today handle large scale real-world clinical data (mostly in the US). However, their approaches may have to be adapted to the legal environments in each specific country.
At present, translation of algorithms for patient stratification into clinical practice is also difficult due to regulatory aspects. Prospective clinical trials required for approval of diagnostic tools by regulatory agencies are very costly and the challenges for finding sponsors are high. One possibility of lowering the associated barriers might be to perform a stepwise approach with initial pilot studies to exemplify the value that can be gained for patients, healthcare sustainability, translational science, and economic efficiency. Such projects would need to showcase the principle value of patient stratification. Moreover, they could provide meaningful insights into disease biology (via biomarkers). These outcomes should ideally be measured longitudinally after machine learning-based stratification and thus provide a feedback loop that helps improve the stratification algorithm.
A commonly stated myth is that health innovation is based on the paradigm of build-and-freeze (https://www.theatlantic.com/technology/archive/2017/10/algorithms-future-of-health-care/543825/), which means that software is built, frozen, and then tested in unchanged form for its lifetime. However, development of better stratification algorithms will require a more seamless updating scheme. There have been interesting developments in recent years in terms of regulation and risk management for continuous learning systems. An example of such a development is the Digital Health Software Precertification (Pre-Cert) Program (https://www.fda.gov/MedicalDevices/DigitalHealth/DigitalHealthPreCertProgram/Default.htm) launched recently by the FDA. PreCert aims at learning and adapting its key elements based on the effectiveness of the program. In addition, Clinical Laboratory Improvement Amendments (CLIA; https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/IVDRegulatoryAssistance/ucm124105.htm) labs provide a template for how health-related software tools developed to inform precision medicine can be validated in a clear and transparent manner as the tool is continually updated. CLIA labs are certified labs that go through a process of regular certifications monitored by the FDA and other regulatory agencies in the US. These labs are required to follow approved and documented Standard Operation Procedures. They can use medical devices, which can include software for diagnostics, given that they employ such Standard Operation Procedures and waive the certification process (https://wwwn.cdc.gov/clia/Resources/WaivedTests/default.aspx). Most importantly, the developer of the tool can update the software. The CLIA labs are independent in deciding whether they will re-validate the software and can adopt a strategy that best serves the technological pace of the software and their clinical needs with respect to increased capabilities or better performance. For instance, a lab may decide to validate only major version releases, such as going from version 1.x to 2.0, and have minor version releases included on the fly.
The vision of precision medicine is to provide the right intervention to the right patient, at the right time and dose. The described approaches, based on iterative feedback between the developers and the clinical end users, could increase our ability to adapt stratification algorithms better to new insights in disease biology, access to new molecular data, and changes in clinical settings. This has been a challenge with promising predictive models often failing validation in independent studies. Real-world longitudinal data from clinical practice and data collected through wearables or other means of participatory data collection cannot only widen the spectrum of possible data sources to build new stratification algorithms [52, 53], but they may also be partially included in clinical trials for validation purposes of stratification algorithms.