Semi-supervised encoding for outlier detection in clinical observation data


Hossein Estiri and Shawn N. Murphy. 2019. “Semi-supervised encoding for outlier detection in clinical observation data.” Computer Methods and Programs in Biomedicine, 181, Pp. 104830.


Background and Objective Electronic Health Record (EHR) data often include observation records that are unlikely to represent the “truth” about a patient at a given clinical encounter. Due to their high throughput, examples of such implausible observations are frequent in records of laboratory test results and vital signs. Outlier detection methods can offer low-cost solutions to flagging implausible EHR observations. This article evaluates the utility of a semi-supervised encoding approach (super-encoding) for constructing non-linear exemplar data distributions from EHR observation data and detecting non-conforming observations as outliers. Methods Two hypotheses are tested using experimental design and non-parametric hypothesis testing procedures: (1) adding demographic features (e.g., age, gender, race/ethnicity) can increase precision in outlier detection, (2) sampling small subsets of the large EHR data can increase outlier detection by reducing noise-to-signal ratio. The experiments involved applying 492 encoder configurations (involving different input features, architectures, sampling ratios, and error margins) to a set of 30 datasets EHR observations including laboratory tests and vital sign records extracted from the Research Patient Data Registry (RPDR) from Partners HealthCare. Results Results are obtained from (30 × 492) 14,760 encoders. The semi-supervised encoding approach (super-encoding) outperformed conventional autoencoders in outlier detection. Adding age of the patient at the observation (encounter) to the baseline encoder that only included observation value as the input feature slightly improved outlier detection. Top-nine performing encoders are introduced. The best outlier detection performance was from a semi-supervised encoder, with observation value as the single feature and a single hidden layer, built on one percent of the data and one percent reconstruction error. At least one encoder configurations had a Youden's J index higher than 0.9999 for all 30 observation types. Conclusion Given the multiplicity of distributions for a single observation in EHR data (i.e., same observation represented with different names or units), as well as non-linearity of human observations, encoding offers huge promises for outlier detection in large-scale data repositories.


SI: Data Quality Assessment