The primary goals of the use of CAIADS
One of the primary goals in the application of CAIADS was to grade colposcopic impressions in accordance with the latest ASCCP colposcopy terminology: normal/benign, low-grade, high-grade, and cancer [16, 17]. The CAIADS was also expected to dichotomously grade colposcopic impressions into two hypothetical biopsy thresholds (low-grade or worse versus normal/benign, and high-grade or worse versus a less severe impression). These categories were used to find an appropriate colposcopically guided biopsy threshold and guide biopsies for detecting the clinically relevant endpoint (pathology-confirmed HSIL+).
Study patients and design
Between January 12, 2018, and December 30, 2018, anonymized digital records of patients, including colposcopic images, non-image information (cytology and HPV status), and pathological results were retrospectively obtained from archived databases of six multicenter hospitals across China (Additional file 1, Table S1), including Shenzhen Maternity and Child Healthcare Hospital (SZMCHH). The pathological results were the gold standard for developing CAIADS and validate its diagnostic performance. The study was approved by the institutional review board (IRB) of SZMCHH. The need for informed consent was waived by the IRB of SZMCHH due to the retrospective nature of archived datasets and fully anonymized personal information.
All patients aged 24–65 years with indications for the need for colposcopy underwent colposcopy imaging and biopsy, and those who were pathologically confirmed were eligible for our study. We excluded patients who lacked definitive pathological results, and we used the WHO classification system: normal/benign, low-grade squamous intraepithelial lesion (LSIL), HSIL, and cancer. All pathology slides of punch biopsies were reviewed by pathologists from SZMCHH. Any disagreement was resolved by a panel of expert pathologists.
The digital records of each patient were split into two categories: (1) those records containing at least five satisfactory colposcopic images commonly with ordinal timeslots (around 0 s, 60 s, 90 s, 120 s, and 150 s) and (2) those records containing non-image (cytology and HPV status), and quality control information conducted by trained evaluators, for which the exclusion criteria are shown in Fig. 1. Sample images in JPEG formats are shown in Additional file 1, Figure S1. The quality control and the complete data were randomly sampled by the severity distribution of pathological results and then assigned to a training and a tuning set for developing CAIADS and to a validation set for evaluating performance in a ratio of 7:1:2. The three datasets are obtained by random sampling according to the patient IDs, which means the patients in the validation set will not be used in the training phase.
For the training set, all selected images were automatically uploaded to an online cervix image annotation tool. These images were analyzed by a group of eight experienced colposcopists from SZMCHH. They carefully manually delineated the lesion areas and biopsy sites near the squamocolumnar junction of the cervical regions, labeling each based on the corresponding biopsy sites of the pathological results. The pathological results were the gold standard. These analyses were supervised by expert colposcopists from the National Cancer Center. The details of annotation and the annotation tool are shown in Additional file 1, Figure S2. For the tuning and validation sets, we had no manual annotations on the images. For all datasets, we made no changes to non-image information.
Development of the CAIADS algorithm
Because colposcopists analyze both images and non-image information (cytology and HPV status) during colposcopic examinations, we developed CAIADS to simulate the diagnostic judgment of colposcopists as accurate as possible. The CAIADS algorithm consists of two deep-learning-based modules for grading colposcopic impressions and guiding biopsies, respectively. A detailed description of the CAIADS algorithm is presented in Additional file 1, Supplementary Method and Figure S3 [18,19,20]. Briefly, the proposed CAIADS first detected the cervical area of images for the subsequent feature extraction. Then, the extracted features were fused by a graphical convolutional network. Finally, the non-image information was concatenated to the fused features of the images to yield the result of grading impressions. Additionally, the CAIADS also predicted the suspected lesion areas to limit the range for guiding biopsy sites.
The pipeline for colposcopic grading consisted of cervix detection, feature extraction, and feature fusion networks, whereas a U-Net [21] and a YOLO [22] were implemented for lesion area segmentation and biopsy site guiding, respectively. Because an accurate lesion area segmentation can effectively reduce the number of unnecessary biopsy sites that fall outside regions containing lesions, we implemented a semi-supervised framework, as shown in Additional file 1, Figure S4. The purpose of this framework was to utilize the tuning set (only with the image-level label) to further boost the segmentation performance of CAIADS. The semi-supervised framework developed on the training set was used to produce pseudo-labels for the tuning set. Then, the tuning set with pseudo-labels was mixed with the training set to fine-tune the U-Net. A subset was separated from the training set to monitor the performance of deep-learning networks during training and to prevent overfitting. Training of the system was halted if no performance increase was observed on the separated subset.
Validation of the CAIADS performance
We compared the agreement of colposcopic impressions of the CAIADS and original colposcopic interpretation by using pathology as the gold standard. The original colposcopic interpretation was determined and recorded by colposcopists based on the assessment of the patient’s images and non-image information. In addition, the diagnostic performance of CAIADS at different hypothetical biopsy thresholds (low-grade or worse and high-grade or worse) for the detection of pathological HSIL+ was evaluated from three aspects. Firstly, we investigated whether the diagnostic performance of CAIADS could be improved by additional non-image information, compared with grading the images alone. Secondly, we compared the performance of the CAIADS at the biopsy threshold of low-grade impression or worse versus high-grade impression or worse. Thirdly, the performance of CAIADS was compared with the original colposcopic interpretation by colposcopists. Finally, we tested the accuracy of the CAIADS in predicting biopsy sites compared with ground truth biopsy sites.
Statistical analyses
The ROC curve was created by plotting the true positive rate (sensitivity) against the false positive rate (1–specificity), and we calculated AUC values. The diagnostic AUC value, accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were evaluated together using 95% confidence intervals (CIs) by the Clopper-Pearson method. We defined the main metric as agreement with the pathological gold standard, measured using kappa values. The McNemar test was used to evaluate the differences in diagnostic performance including agreement, accuracy, sensitivity, and specificity. A p value less than 0.05 (two-sided) was considered to be statistically significant. Statistical analyses were done using SAS 9.4 software (SAS Institute Inc., Cary, NC, USA), Python 3.6, and scikit-learn [23].