Many of the symptoms characterized as the post-acute sequelae of SARS-CoV-2 infection (PASC) could have multiple causes or similarly seen in non-COVID patients. An accurate identification of phenotypes will be important to guide future research and the healthcare system to focus its efforts and resources on adequately controlled age- and gender-specific sequelae of COVID-19 infection. This retrospective electronic health records (EHR) cohort study, we applied a computational framework for knowledge discovery from clinical data, MLHO, to identify phenotypes that positively associate with a past positive PCR test for COVID-19. We evaluated the post-test phenotypes in two temporal windows at 3-6 and 6-9 months after the test and by age and gender. We utilized longitudinal diagnosis records stored in EHRs from Mass General Brigham (MGB) 57 thousand patients who tested positive or negative for COVID-19 and were not hospitalized. Statistical analyses were performed on data from March 2020 to March 2021. PCR test results and subsequent diagnosis records that were recorded for the first time two months or later after the PCR test. We identified 28 phenotypes among different age/gender cohorts or time windows that positively associated with a past SARS-CoV-2 infection. All identified phenotypes were newly recorded in patients’ medical records two months or longer after a COVID-19 PCR test in non-hospitalized patients regardless of the test result. Among these phenotypes, a new diagnosis record for anosmia and dysgeusia (OR 2.17, 95% CI [1.42 - 3.25]), alopecia (OR 3.54, 95% CI [2.92 - 4.3]), chest pain (OR 1.35, 95% CI [1.16 - 1.56]), or chronic fatigue syndrome (OR 1.81-2.28, 95% CI [1.38 - 3.68]) are the most significant indicators of a past COVID-19 infection, especially among women younger than 65. Among men, edema (OR 1.83, 95% CI [1.23 - 2.66]) and disease of nail (OR 3.54, 95% CI [1.63 - 7.29]) in patients 65 and older or proteinuria (OR 2.66, 95% CI [1.61 - 4.34]) in patients under 65 are associated with a positive COVID-19 PCR test in the past few months. Our approach avoids a flood of false positive discoveries, while offering a more probabilistic flexible criterion than the standard linear phenome-wide association study (PheWAS). These findings suggest that some of the previously identified post sequelae of COVID-19 may not be accurate and that most of the PASC are observed in patients under 65 years of age.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by the National Human Genome Research Institute grant 3U01HG008685-05S2 and the National Library of Medicine grant T15LM007092.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The use of clinical data in this study was approved by the MGB Human Research Committee with a waiver of informed consent.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).Yes I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesData contains PHI and therefore is not publicly available.
The COVID-19 pandemic has devastated the world with health and economic wreckage. Precise estimates of adverse outcomes from COVID-19 could have led to better allocation of healthcare resources and more efficient targeted preventive measures, including insight into prioritizing how to best distribute a vaccination. We developed MLHO (pronounced as melo), an end-to-end Machine Learning framework that leverages iterative feature and algorithm selection to predict Health Outcomes. MLHO implements iterative sequential representation mining, and feature and model selection, for predicting patient-level risk of hospitalization, ICU admission, need for mechanical ventilation, and death. It bases this prediction on data from patients' past medical records (before their COVID-19 infection). MLHO's architecture enables a parallel and outcome-oriented model calibration, in which different statistical learning algorithms and vectors of features are simultaneously tested to improve prediction of health outcomes. Using clinical and demographic data from a large cohort of over 13,000 COVID-19-positive patients, we modeled the four adverse outcomes utilizing about 600 features representing patients' pre-COVID health records and demographics. The mean AUC ROC for mortality prediction was 0.91, while the prediction performance ranged between 0.80 and 0.81 for the ICU, hospitalization, and ventilation. We broadly describe the clusters of features that were utilized in modeling and their relative influence for predicting each outcome. Our results demonstrated that while demographic variables (namely age) are important predictors of adverse outcomes after a COVID-19 infection, the incorporation of the past clinical records are vital for a reliable prediction model. As the COVID-19 pandemic unfolds around the world, adaptable and interpretable machine learning frameworks (like MLHO) are crucial to improve our readiness for confronting the potential future waves of COVID-19, as well as other novel infectious diseases that may emerge.
The growing recognition of algorithmic bias has spurred discussions about fairness in artificial intelligence (AI) / machine learning (ML) algorithms. The increasing translation of predictive models into clinical practice brings an increased risk of direct harm from algorithmic bias; however, bias remains incompletely measured in many medical AI applications. Using data from over 56 thousand Mass General Brigham (MGB) patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), we evaluate unrecognized bias in four AI models developed during the early months of the pandemic in Boston, Massachusetts that predict risks of hospital admission, ICU admission, mechanical ventilation, and death after a SARS-CoV-2 infection purely based on their pre-infection longitudinal medical records.We discuss that while a model can be biased against certain protected groups (i.e., perform worse) in certain tasks, it can be at the same time biased towards another protected group (i.e., perform better). As such, current bias evaluation studies may lack a full depiction of the variable effects of a model on its subpopulations.If the goal is to make a change in a positive way, the underlying roots of bias need to be fully explored in medical AI. Only a holistic evaluation, a diligent search for unrecognized bias, can provide enough information for an unbiased judgment of AI bias that can invigorate follow-up investigations on identifying the underlying roots of bias and ultimately make a change.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was supported by the National Human Genome Research Institute grant 3U01HG008685-05S2.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB of MGB gave ethical approval for this work.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesProtected Health Information restrictions apply to the availability of the clinical data here, which were used under IRB approval for use only in the current study. As a result, this dataset is not publicly available.
This study aims to predict death after COVID-19 using only the past medical information routinely collected in electronic health records (EHRs) and to understand the differences in risk factors across age groups. Combining computational methods and clinical expertise, we curated clusters that represent 46 clinical conditions as potential risk factors for death after a COVID-19 infection. We trained age-stratified generalized linear models (GLMs) with component-wise gradient boosting to predict the probability of death based on what we know from the patients before they contracted the virus. Despite only relying on previously documented demographics and comorbidities, our models demonstrated similar performance to other prognostic models that require an assortment of symptoms, laboratory values, and images at the time of diagnosis or during the course of the illness. In general, we found age as the most important predictor of mortality in COVID-19 patients. A history of pneumonia, which is rarely asked in typical epidemiology studies, was one of the most important risk factors for predicting COVID-19 mortality. A history of diabetes with complications and cancer (breast and prostate) were notable risk factors for patients between the ages of 45 and 65 years. In patients aged 65-85 years, diseases that affect the pulmonary system, including interstitial lung disease, chronic obstructive pulmonary disease, lung cancer, and a smoking history, were important for predicting mortality. The ability to compute precise individual-level risk scores exclusively based on the EHR is crucial for effectively allocating and distributing resources, such as prioritizing vaccination among the general population.
OBJECTIVE: High-throughput electronic phenotyping algorithms can accelerate translational research using data from electronic health record (EHR) systems. The temporal information buried in EHRs is often underutilized in developing computational phenotypic definitions. This study aims to develop a high-throughput phenotyping method, leveraging temporal sequential patterns from EHRs.
MATERIALS AND METHODS: We develop a representation mining algorithm to extract 5 classes of representations from EHR diagnosis and medication records: the aggregated vector of the records (aggregated vector representation), the standard sequential patterns (sequential pattern mining), the transitive sequential patterns (transitive sequential pattern mining), and 2 hybrid classes. Using EHR data on 10 phenotypes from the Mass General Brigham Biobank, we train and validate phenotyping algorithms.
RESULTS: Phenotyping with temporal sequences resulted in a superior classification performance across all 10 phenotypes compared with the standard representations in electronic phenotyping. The high-throughput algorithm's classification performance was superior or similar to the performance of previously published electronic phenotyping algorithms. We characterize and evaluate the top transitive sequences of diagnosis records paired with the records of risk factors, symptoms, complications, medications, or vaccinations.
DISCUSSION: The proposed high-throughput phenotyping approach enables seamless discovery of sequential record combinations that may be difficult to assume from raw EHR data. Transitive sequences offer more accurate characterization of the phenotype, compared with its individual components, and reflect the actual lived experiences of the patients with that particular disease.
CONCLUSION: Sequential data representations provide a precise mechanism for incorporating raw EHR records into downstream machine learning. Our approach starts with user interpretability and works backward to the technology.
Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype "differentiators" and predictors than the "atemporal" EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history.