A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates


This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.
DOI: 10.1007/s40745-020-00296-8
Last updated on 10/28/2020