COMET, an innovative machine learning framework, integrates electronic health records (EHR) data and omics analyses through transfer learning, greatly improving predictive modeling and revealing biological insights from small sample sizes.
The latest research, published in the journal Nature Machine Intelligence, showcases clinical and omics multimodal analysis improved with transfer learning (COMET), which is a protocol for deep learning and transfer learning.
Advancements in omics technologies have transformed the comprehension of biological processes. Innovations in proteomics, metabolomics, transcriptomics, and additional assays have facilitated affordable assessments of analytes from the same sample. While these assays produce high-dimensional data, limitations in budget and clinical scenarios restrict the scale of omics cohorts. This necessitates the development of creative techniques to enhance the analysis of high-dimensional datasets.
Although statistical techniques can manage false positives, there are considerably fewer methods available for machine learning (ML). Some methodologies utilize transfer learning, where an ML model is developed from a pre-existing dataset, subsequently applied to a smaller dataset. While contemporary deep learning approaches have been integrated into statistical models, their reliance primarily lies on learning from informative metadata or omics data alone.
The COMET framework addresses these challenges by incorporating pretraining on extensive electronic health record (EHR) datasets and merging early and late fusion methodologies, which results in improved predictive capabilities and biological discoveries.
The research and outcomes
In this current study, researchers revealed COMET, a deep learning and transfer learning methodology that augments omics analyses. COMET can be utilized when both electronic health records (EHR) and omics data are accessible within both larger and smaller cohorts. This framework includes a strategy for embedding longitudinal EHR data, as well as pre-training and multimodal modeling.
COMET employs an ML model that is exclusively trained on EHR data, transferring its weights to a multimodal architecture that is trained and assessed on a smaller sample that includes both omics and EHR data. Initially, COMET was employed to forecast the onset of labor in a pregnancy cohort consisting of over 30,904 participants from Stanford Healthcare. Approximately 61 pregnant women (omics cohort) provided multiple plasma samples during the final days of pregnancy, which were utilized to create a proteomics dataset measuring 1,317 proteins.
EHR data spanning from the onset of pregnancy through blood sampling were applied to anticipate days until labor begins. Following the pre-training on EHR-only data (from 30,843 individuals), parameters were passed to a multimodal network designed to make predictions concerning the omics cohort. The model reached a Pearson correlation coefficient of 0.868 (95% confidence interval [0.825, 0.900]), showcasing its robust predictive capability. A significant correlation existed between the estimated days until labor and the actual number of days until labor, indicating that COMET was extremely precise in small cohorts featuring multidimensional data.
Subsequently, COMET was juxtaposed with baseline models utilizing solely proteomics data, EHR data, or both. These baseline models exclusively employed omics cohort data, without pre-training. The EHR-only baseline model demonstrated the poorest results, obtaining a correlation of 0.768, while the proteomics-only model performed marginally better at 0.796. The combined baseline model was the highest among the baselines, with a correlation of 0.815, yet still lagged behind COMET.
To extract more profound insights, researchers applied t-distributed stochastic neighbor embedding (t-SNE) to visualize multimodal data by projecting the correlation matrix into two dimensions, revealing significant clusters of features based on their correlation patterns. Nearby features depict similar correlations with all other variables within the space. These clusters were annotated according to the medical concepts the EHR or protein features represent within those clusters. A variety of proteins demonstrated notable correlations with EHR variables.
The team calculated the importance of each protein feature. Proteins identified as significantly relevant in COMET models were associated with fetal development, pregnancy complications, and gestational age, consistent with established biological knowledge. Following this, COMET was utilized on a cancer cohort from the United Kingdom (UK) Biobank to forecast three-year cancer mortality. Participants included all patients diagnosed with any form of cancer within five years of enrollment.
A specific group of participants had blood samples analyzed for proteomics data. They were included in the omics cohort if their samples were gathered within one year of cancer diagnosis. Consistently, COMET delivered superior results in predicting three-year cancer mortality relative to all baselines, achieving an area under the receiver operating characteristic curve (AUROC) of 0.842, significantly surpassing the combined baseline (AUROC 0.786) and single-modality models. The prevalence of three-year mortality in the omics cohort was 5.5%.
Moreover, t-SNE was applied to visualize the correlation matrix, revealing diminished overlap between EHR and proteomics data modalities compared to labor onset data. However, significant correlations between EHR and proteomics data modalities were apparent when the correlation network was visualized, with each modality individually projected into two dimensions. Mortality factor 4-like protein 2 displayed the strongest correlations with EHR features, particularly concerning drug prescriptions, underscoring its potential as a prognostic biomarker.
A substantial proportion of proteins in cancer patients (66%) demonstrated no correlation with any EHR variable. Additionally, researchers evaluated the correlation between each EHR feature and all proteins, determining the maximum correlation across all proteins for each EHR feature. This uncovered numerous EHR features with low correlations to proteins in cancer patients, emphasizing the importance of incorporating multiple data modalities.
Proteins exhibiting greater feature importance in COMET models aligned with recognized cancer prognostic biomarkers. Notably, nine proteins that were deemed more significant in COMET models were statistically linked to mortality status, further substantiating the model’s biological relevance.
Conclusions
In conclusion, the investigation showcased COMET’s capability to enhance predictive modeling across various tasks through pre-training and transfer learning. COMET produced better-regularized models, which more accurately represented established biology. Furthermore, COMET models identified biologically pertinent proteins for specific health outcomes.
Within labor onset models, COMET indicated proteins essential for pregnancy complications, immune regulation, and placental development, with Pearson correlation values endorsing its predictive proficiency. For cancer mortality, identified proteins were those involved in tumor proliferation and microenvironment modulation. Overall, COMET lays a groundwork for delineating intricate relationships between clinical phenotypes and molecular mechanisms.
Journal reference:
- Mataraso SJ, Espinosa CA, Seong D, et al. A machine learning approach to leveraging electronic health records for enhanced omics analysis. Nature Machine Intelligence, 2025, DOI: 10.1038/s42256-024-00974-9,