Summary Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype “differentiators” and predictors than the “atemporal” EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history.
Age is an important proxy for many life course trajectories, and has complex and understudied relationships with energy consumption. We evaluated the presence and the shape of an age-energy consumption profile in the U.S. residential sector, using household-level data from four waves of the Residential Energy Consumption Survey (RECS) in 1987, 1990, 2005, and 2009. We constructed pseudo-cohorts from Bayesian generalized linear model estimates to create micro-profiles for energy consumption across the life course. Overall, we found that residential energy consumption increases over the life course. Much of the increase in energy consumption is due to housing size. Variations in the age-energy consumption micro-profiles can be described by concave and convex functions that transform from one to another across the life course. We conclude with a demographic perspective on the future of residential energy demand in the U.S. The confluence of demographic and climatic changes will likely cause an amplification of effects, challenging the supply and demand of energy for the older population.
Hossein Estiri, Jeffrey G Klann, Sarah R Weiler, Ernest Alema-Mensah, R Joseph Applegate, Galina Lozinski, Nandan Patibandla, Kun Wei, William G Adams, Marc D Natter, Elizabeth O Ofili, Brian Ostasiewski, Alexander Quarshie, Gary E Rosenthal, Elmer V Bernstam, Kenneth D Mandl, and Shawn N Murphy. 2019. “A federated EHR network data completeness tracking system.” Journal of the American Medical Informatics Association, 26, 7, Pp. 637-645. Publisher's VersionAbstract
The study sought to design, pilot, and evaluate a federated data completeness tracking system (CTX) for assessing completeness in research data extracted from electronic health record data across the Accessible Research Commons for Health (ARCH) Clinical Data Research Network.The CTX applies a systems-based approach to design workflow and technology for assessing completeness across distributed electronic health record data repositories participating in a queryable, federated network. The CTX invokes 2 positive feedback loops that utilize open source tools (DQe-c and Vue) to integrate technology and human actors in a system geared for increasing capacity and taking action. A pilot implementation of the system involved 6 ARCH partner sites between January 2017 and May 2018.The ARCH CTX has enabled the network to monitor and, if needed, adjust its data management processes to maintain complete datasets for secondary use. The system allows the network and its partner sites to profile data completeness both at the network and partner site levels. Interactive visualizations presenting the current state of completeness in the context of the entire network as well as changes in completeness across time were valued among the CTX user base.Distributed clinical data networks are complex systems. Top-down approaches that solely rely on technology to report data completeness may be necessary but not sufficient for improving completeness (and quality) of data in large-scale clinical data networks. Improving and maintaining complete (high-quality) data in such complex environments entails sociotechnical systems that exploit technology and empower human actors to engage in the process of high-quality data curating.The CTX has increased the network’s capacity to rapidly identify data completeness issues and empowered ARCH partner sites to get involved in improving the completeness of respective data in their repositories.
Background and Objective Electronic Health Record (EHR) data often include observation records that are unlikely to represent the “truth” about a patient at a given clinical encounter. Due to their high throughput, examples of such implausible observations are frequent in records of laboratory test results and vital signs. Outlier detection methods can offer low-cost solutions to flagging implausible EHR observations. This article evaluates the utility of a semi-supervised encoding approach (super-encoding) for constructing non-linear exemplar data distributions from EHR observation data and detecting non-conforming observations as outliers. Methods Two hypotheses are tested using experimental design and non-parametric hypothesis testing procedures: (1) adding demographic features (e.g., age, gender, race/ethnicity) can increase precision in outlier detection, (2) sampling small subsets of the large EHR data can increase outlier detection by reducing noise-to-signal ratio. The experiments involved applying 492 encoder configurations (involving different input features, architectures, sampling ratios, and error margins) to a set of 30 datasets EHR observations including laboratory tests and vital sign records extracted from the Research Patient Data Registry (RPDR) from Partners HealthCare. Results Results are obtained from (30 × 492) 14,760 encoders. The semi-supervised encoding approach (super-encoding) outperformed conventional autoencoders in outlier detection. Adding age of the patient at the observation (encounter) to the baseline encoder that only included observation value as the input feature slightly improved outlier detection. Top-nine performing encoders are introduced. The best outlier detection performance was from a semi-supervised encoder, with observation value as the single feature and a single hidden layer, built on one percent of the data and one percent reconstruction error. At least one encoder configurations had a Youden's J index higher than 0.9999 for all 30 observation types. Conclusion Given the multiplicity of distributions for a single observation in EHR data (i.e., same observation represented with different names or units), as well as non-linearity of human observations, encoding offers huge promises for outlier detection in large-scale data repositories. https://github.com/hestiri/superencoder
In this paper we propose a household sorting model for the 50 largest US metropolitan regions and evaluate the model using 2010 Census data. To approximate residential locations for household cohorts, we specify a Cohort Location Model (CLM) built upon two principle assumptions about housing consumption and metropolitan development/land use patterns. According to our model, the expected distance from the household’s residential location to the city centre(s) increases with the age of the householder (as a proxy for changes in housing career over life span). The CLM provides a flexible housing-based explanation for household sorting patterns in US metropolitan regions. Results from our analysis on US metropolitan regions show that households headed by individuals under the age of 35 are the most common cohort in centrally located areas. We also found that households over 35 are most prevalent in peripheral locations, but their sorting was not statistically different across space.
Most determinants of health originate from the “contexts” in which we live, which has remained outside the confines of the U.S. healthcare system. This issue has left providers unprepared to operate with an ample understanding of the challenges patients may face beyond their purview. The recent shift to value-based care and increasing prevalence of Electronic Health Record (EHR) systems provide opportunities to incorporate upstream contextual factors into care. We discuss that incorporating context into care is hindered by a chicken-and-egg dilemma – ie, lack of evidence on the utility of contextual data at the point of care, where contextual data are missing due to the lack of an informatics infrastructure. We argue that if we build the informatics infrastructure today, EHRs can give the tomorrow’s clinicians the tools and the data they need to transform the U.S. healthcare from episodic and reactive to preventive and proactive. We also discuss system design considerations to improve efficacy of the suggested informatics infrastructure, which include systematically prioritizing contextual data domains, developing interoperability standards, and ensuring that integration of contextual data does not disrupt clinicians’ workflow.
Data variability is a commonly observed phenomenon in Electronic Health Records (EHR) data networks. A common question asked in scientific investigations of EHR data is whether the cross-site and -time variability reflects an underlying data quality error at one or more contributing sites versus actual differences driven by various idiosyncrasies in the healthcare settings. Although research analysts and data scientists have commonly used various statistical methods to detect and account for variability in analytic datasets, self service tools to facilitate exploring cross-organizational variability in EHR data warehouses are lacking and could benefit from meaningful data visualizations. DQe-v, an interactive, database-agnostic tool for visually exploring variability in EHR data provides such a solution. DQe-v is built on an open source platform, R statistical software, with annotated scripts and a readme document that makes it fully reproducible. To illustrate and describe functionality of DQe-v, we describe the DQe-v’s readme document which includes a complete guide to installation, running the program, and interpretation of the outputs. We also provide annotated R scripts and an example dataset as supplemental materials. DQe-v offers a self service tool to visually explore data variability within EHR datasets irrespective of the data model. GitHub and CIELO offer hosting and distribution of the tool and can facilitate collaboration across any interested community of users as we target improving usability, efficiency, and interoperability.
To provide an open source, interoperable, and scalable data quality assessment tool for evaluation and visualization of completeness and conformance in electronic health record (EHR) data repositories.This article describes the tool’s design and architecture and gives an overview of its outputs using a sample dataset of 200 000 randomly selected patient records with an encounter since January 1, 2010, extracted from the Research Patient Data Registry (RPDR) at Partners HealthCare. All the code and instructions to run the tool and interpret its results are provided in the Supplementary Appendix.DQe-c produces a web-based report that summarizes data completeness and conformance in a given EHR data repository through descriptive graphics and tables. Results from running the tool on the sample RPDR data are organized into 4 sections: load and test details, completeness test, data model conformance test, and test of missingness in key clinical indicators.Open science, interoperability across major clinical informatics platforms, and scalability to large databases are key design considerations for DQe-c. Iterative implementation of the tool across different institutions directed us to improve the scalability and interoperability of the tool and find ways to facilitate local setup.EHR data quality assessment has been hampered by implementation of ad hoc processes. The architecture and implementation of DQe-c offer valuable insights for developing reproducible and scalable data science tools to assess, manage, and process data in clinical data repositories.