Study population and design
The study population was drawn from the membership of Kaiser Permanente Northern California (KPNC), an integrated health care delivery system serving 4.5 million members. The KPNC membership accounts for approximately 30% of the underlying population and is socio-demographically representative of the population residing in the geographic areas served [11, 12]. The integrated information system permits quantifying predictors and outcomes across the continuum of pregnancy. Individuals with GDM are identified by searching the KPNC Pregnancy Glucose Tolerance and GDM Registry, which is an active surveillance registry that downloads laboratory data to determine screening and diagnosis for GDM, where preexisting type 1 or 2 diabetes is automatically excluded. Specifically, pregnant individuals at KPNC receive universal screening (98%) for GDM with the 50-g, 1-h glucose challenge test (GCT) at 24–28 weeks’ gestation [1]. If the screening test is abnormal, a diagnostic 100-g, 3-h oral glucose tolerance test (OGTT) is performed after an 8–12-h fast. GDM is ascertained by meeting any of the following criteria: (1) ≥ 2 OGTT plasma glucose values meeting or exceeding the Carpenter-Coustan thresholds: 1-h 180 mg/dL, 2-h 155 mg/dL, and 3-h 140 mg/dL; or (2) 1-h GCT ≥ 180 mg/dL and a fasting glucose ≥ 95 mg/dL performed alone or during the OGTT [13, 14]. Plasma glucose measurements were performed using the hexokinase method at the KPNC regional laboratory, which participated in the College of American Pathologists’ accreditation and monitoring program [15]. This data-only project was approved by the KPNC Institutional Review Board, which waived the requirement for informed consent from participants.
Among 405,557 pregnancies with a gestational age at delivery < 24 weeks’ gestation delivered at 21 KPNC hospitals from January 1, 2007, to December 31, 2017, we excluded 375,041 (92.5%) individuals without GDM. Among 30,516 GDM pregnancies, we further excluded individuals with GDM diagnosed before the universal GDM screening (n = 42), deriving an analytical sample of 30,474 GDM-complicated pregnancies. We further derived a discovery set containing 27,240 GDM-complicated pregnancies from 2007 to 2016 and a temporal/future validation set of 3234 GDM-complicated pregnancies in 2017 (Fig. 1).
Outcome ascertainment
Individuals diagnosed with GDM received universal referral to the KPNC Regional Perinatal Service Center for the supplemental care program beyond their standard of prenatal care. MNT was the first-line therapy. If glycemic control targets were not achieved with MNT alone, pharmacologic treatment was initiated. Based on counseling regarding risks and benefits of antidiabetic oral agents versus insulin, pharmacologic treatment was chosen via a patient-physician shared decision-making model: (1) with antidiabetic oral agents such as glyburide and metformin being added to MNT and if optimal glycemic control continued to fail, oral medication was escalated to insulin therapy, and (2) or with insulin therapy initiated directly beyond MNT (an additional table shows this in more detail [see Additional file 1]). We searched the pharmacy information management database for prescriptions for oral agents (glyburide 97.9%, metformin or other) and insulin after GDM diagnosis. Treatment modality was grouped as MNT only and pharmacologic treatment (oral agents and/or insulin) beyond MNT. Notably, despite an overall large sample size, we grouped oral agents (32.6% of the entire population) and insulin (6.2%) into pharmacologic treatment due to insufficient power to predict insulin separately as an outcome.
Candidate predictors
Based on risk factors associated with GDM treatment modality and input from clinicians, we selected 176 (64 continuous and 112 categorical) sociodemographic, behavioral, and clinical candidate predictors obtained from electronic health records for model development. Candidate predictors were divided into four levels based on availability at varied stages of pregnancy (an additional table shows this in more detail [see Additional file 2]): Level 1 predictors (n = 68) were available at the initiation of pregnancy and dated back to 1 year prior to the index pregnancy; level 2 predictors (n = 26) were measured from the last menstrual period to before GDM diagnosis; level 3 predictors (n = 12) were available at the time of GDM diagnosis; and level 4 (n = 70) included self-monitoring of blood glucose (SMBG) levels, as the primary measure of glycemic control during pregnancy as recommended by the American Diabetes Association [5], measured the first week after the GDM diagnosis. All predictors, levels 1–4, were measured prior to the outcome of interest (i.e., final line of GDM treatment). Pregnant individuals with GDM in our study population had on average, 11.8 weeks (standard deviation: 6.6 weeks), of SMBG measurements between GDM diagnosis and delivery. We included data 1 week after GDM diagnosis to allow earlier prediction since it takes on average 5.6 weeks between GDM diagnosis and the optimal treatment is offered. Of note, individuals with GDM were universally offered enrollment to a supplemental GDM care program managed by nurses and dietitians via telemedicine from the KPNC Regional Perinatal Service Center [16]. All individuals with GDM were instructed to self-monitor and record glucose measurements four times per day: fasting before breakfast and 1 h after the start of each meal. Measurements of SMBG were then reported to the nurses or registered dieticians during weekly telephone counseling calls from enrollment until delivery and data were recorded in the Patient Reported Capillary Glucose Clinical Database.
Statistical analysis
Preprocessing
We imputed missing values with the random forest algorithm since the algorithm does not require parametric model assumptions, which reduce the efficiency of the predictor (an additional table shows this in more detail [see Additional file 2]). We evaluated the estimation of true imputation error using normalized root mean squared error and proportion of falsely classified entries for continuous and categorical variables, respectively. Both values were close to 0, indicating good performance in imputation (an additional table shows this in more detail [see Additional file 3]). After preprocessing, we employed t-test and Pearson’s chi-squared test to compare participant characteristics between the discovery and temporal/future validation sets. We conducted the Mann–Kendall test to examine secular trends for GDM treatment modalities across calendar years. The discovery set (2007–2016) was stratified by the calendar year and treatment modality for tenfold cross validation. The temporal/future validation set (2017) was stratified by treatment modality for cross-validated prediction performance computation.
Variable selection and full model development and comparison
We performed prediction through classification and regression tree (CART), least absolute shrinkage and selection operator (LASSO) regression, and super learner (SL) predicting with levels 1, 1–2, 1–3, and 1–4 predictors, respectively. CART and LASSO regression were chosen as simple prediction methods compared to SL. The SL defines a set of candidate machine learning algorithms, namely, the library, and combines prediction results through meta-learning via cross-validation [17]. SL has the asymptotic property that it is at least as good (in risk, defined by the negative log-likelihood) as the best fitting algorithm in the library [17]. Although the variables included in the final ensemble SL cannot be easily interpreted for their individual contributions, SL can be used for optimal prediction performance and to benchmark simpler and less adaptive approaches [17].
We tuned the prediction methods as follows. In CART, the Gini index measured the heterogeneity composition of the subset with respect to the outcome, and maximum depth (6) was defined as the stopping criterion. Accounting for potential errors from the risk curve estimation, the regularization parameter in LASSO regression was selected from the cross-validated error within one standard error of its minimum value [18]. For the SL, we considered a simple and a complex library for comparison. The simple library included the response-mean, LASSO regression, and CART; the complex library expanded by additionally including random forest and extreme gradient boosting (XGBoost). Multiple XGBoosts were considered, where their tuning parameters were set to 10, 20, 50 trees, 1 to 6 maximum depths, and 0.001, 0.01, and 0.1 shrinkage for regularization.
For models using predictors at each level, prediction results were evaluated using tenfold cross-validated receiver operating characteristic curves and area under the receiver operating characteristic curve (AUC) statistics in the discovery and temporal/future validation sets. We used Delong’s test to compare AUCs between different prediction algorithms at the same predictor level and within the same prediction algorithm across levels, respectively [19]. We used permutation-based variable importance to calculate the AUCs with 5 simulations and obtained the top 10 important features. Permuting one variable at a time, the method calculated the AUC difference before and after permutation to assign an importance measure [20]. The model with the highest AUC in the validation set was selected as the final full model.
Development of simpler models
To improve interpretability and potential clinical uptake, we used tenfold cross-validated logistic regression to develop simpler models in the discovery set based on a minimal set of the most important features at each level, as opposed to the full set of features used in the complex SL. We additionally selected interaction term(s) considering all cross-products through stepwise forward and backward selection by the Akaike information criterion. We evaluated the predictive performance (i.e., simplicity and cross-validated AUCs) of these simpler models on the validation set. Further, calibration was examined by evaluating the quality of an uncalibrated model via the integrated calibration index, which captured the distribution of predicted probabilities, coupled with a calibration plot. Calibration method (i.e., isotonic regression) was implemented for recalibration in the event of observed over- or under-prediction.