Summary: We demonstrate several advantages of applying data mining techniques on time-dependent Electronic Medical Records (EMR), specifically: 1) combining structured and unstructured variables improves the accuracy of a type-2 diabetes (T2D) classification algorithm, 2) conducting a quantitative survey of multiple comorbidities is important in T2D especially cardiovascular complications with hazard ratios, 3) analyzing time dependent variables can clarify time dependent contributions to variety of comorbidities, and specifically of the “obesity paradox”, and 4) demonstrating that an unbiased examination of physician treatment patterns reveals changes over time consistent with clinical trials.
Background: Cohorts assembled from EMR present a potentially powerful resource to study T2D and cardiovascular complications at population scale. Recent reports have demonstrated the utility of EMR analysis to discover genotype-phenotype correlations, sub-categories of disease, and adverse drug events.
Methods: We developed a classification algorithm to identify T2D patients based on characteristics including clinical notes, diagnosis and procedure codes, medications, and laboratory tests. We analyzed an EMR database at MGH and BWH considering patients who received care between 1990 - 2013. We applied logistic regression with the adaptive LASSO using different combinations of variables such as structured variables only, unstructured variables only, and combination of all variables. To determine the level of association between clinical and demographic variables with mortality we developed baseline and lagged-time varying Cox regression models that included an adjustment to ethnicity and time varying covariates. To observe at changes in frequency ratios of medical concepts as a function of time, considering also the effects of clinical trials publications, we focus on heart failure related concepts extracted from clinical notes (e.g., Aldosterone, Biventricular Pacemaker). To assess how therapeutic relationships change over time, we calculated sparse covariance matrices.
Results: Our classification algorithm identified 65,099 T2D patients with a specificity of 97% and PPV of 96%. The definition of “gold standard” included ≥ 1 measurements of HGB A1C ≥ 6.5% among other criteria. 56,691 patients (87.1%) had two and 38,449 patients (59.1%) had four or more chronic conditions, demonstrating the complexity of the cohort that we created in comparison with administrative claims databases that lack many clinical details. Cox regression models indicated statistically significant HRs > 1 for CHF, CAD, and CVD, and HRs < 1 for PCI and CABG. HRs for BMI were particularly interesting as increasing levels were associated with significant lower mortality as compared to the reference BMI (< 25 kg/m2). Further stratifying the results into 1, 3 and 5 years analysis, this “obesity paradox” is strikingly obvious at short-term follow-up of 1 year. It may be due to the fact that patients with low BMI were suffering from chronic medical conditions (e.g., malignancy or inflammatory conditions) increasing their 1 year mortality. However, at 3 and 5 years follow-up, we do see increase in mortality with increasing BMI levels likely related to increase in the burden of cardiovascular events.
Discussion: We implemented classification, prediction, and natural language processing techniques in multiple scenarios to create and to analyze a highly complex and large cohort, to aid understanding better patients, time-dependent entities digitally represented as a collection of data elements.