Skip to main content
Fig. 5 | BMC Medicine

Fig. 5

From: Dense phenotyping from electronic health records enables machine learning-based prediction of preterm birth

Fig. 5

Machine learning-based clustering of deliveries identifies subgroups with distinct PTB prevalence, clinical features, and prediction accuracy. A For the model predicting preterm birth at 28 weeks of gestation using billing codes (ICD-9 and CPT, Fig. 4A), we assigned deliveries from the held-out test set (n = 2246) to one of six clusters (colors) using density-based clustering (HDBSCAN) on the SHAP feature importance matrix. For visualization of the clusters, we used UMAP to embed the deliveries into a low-dimensional space based on the matrix of feature importance values. The inset pie chart displays the count of individuals in each cluster. B The preterm birth prevalence (color bar) in each cluster. The algorithm discovered four clusters with high PTB prevalence (enclosed by a dashed line). C Precision and D recall for preterm birth classification within each cluster. E The enrichment (odds ratios, color bar in log10 scale) of race as derived from EHRs for each cluster (Additional file 1: Table S1). F The enrichment (log10 odds ratio) of relevant clinical risk factors in each cluster (Additional file 1: Table S2). Risk factors include age at delivery (> 34 or < 18 years old), pre-pregnancy BMI (prepreg BMI), pre-pregnancy hypertension (prepreg hypertension), gestational hypertension (gest hypertension), and fetal abnormalities. We report the total number of women in the delivery cohort at high risk for each clinical risk factor (n). Enrichments for additional risk factors are given in Additional file 1: Fig. S7

Back to article page