Publications by Type: Journal Article

In Preparation
Tang C. (first author). In Preparation. “Evaluating Fairness Criteria That Address Disparities in Diabetes.” Journal of the American Medical Informatics Association.Abstract
Objective: We aim to compare the performance of different fairness criteria on their ability to identify disparities in the temporal modeling of minority populations (i.e., African American, Hispanic, Asian) with diabetes to predict the trend of long term outcomes (i.e., improvement, stagnation, decline).
Methods: We utilized a one-step feedback delayed model to evaluate three fairness criteria (i.e., maximum utility, equal opportunity, demographic parity) via synthetic data generated from 459,280 real-world diabetes cases. We characterized long-term outcomes in discussing these fairness criteria and identified where they exhibit qualitatively different behavior.
Results: All three fairness criteria can lead to all possible outcomes (i.e., improvement, stagnation, decline) in natural parameter regimes. Our results demonstrate that, without a careful model of delayed outcomes, both the institution and the patient may cause harm in cases where an unconstrained objective would not.
Conclusions: The delayed model can help foresee the impact a fairness criterion would have if enforced as a constraint in a classification system.
The abstract has been accepted for virtual Discover Brigham 2020.
Tang C. (first author). In Preparation. “Quantifying Emerging Data Capital: An Experiment in Social Media Clout.” Proceedings of the National Academy of Sciences of the United States of America.Abstract
Network analytics using a force-directed graph drawing algorithm can offer a variety of intuitive processes to understand social gravity. We assume that such gravity is influenced by social media clout, representing interpersonal relationships expressed by language response. While immediate or delayed language responses may define a distant or close relationship between two individuals, this research only considers the evolution of the topics discussed in in-group members (i.e., they are alike) for data privacy concerns. We first utilize a dynamic Latent Dirichlet Allocation model to extract a specified number of topics from two Facebook™ pages (10 topics for each) between October 1, 2016, and September 30, 2018. One page lists a closed group with 18,946 members (as of the same time) created on December 3, 2006; the other has an open group with 11,999 members (as of September 2018) built on May 10, 2008. A total of 3,952 people participates in overlapping. Next, we present techniques for using social gravity as force-directed layouts to produce drawings of complex networks for these topics. The force-directed graphs are an intuitive representation of each topic’s origin and evolution in posts and inter-human relations among the post categories to demonstrate real-time topic intensity. We then evaluate the result networks with the Ramsey Theorem. The two public and private venture-capital groups’ findings provide direct evidence of exploring data capital opportunities offered by social media clout.



Force-directed layout figures (can be opened by google chrome)
Tang C. (corresponding author). Submitted. “Embedding, Aligning, and Reconstructing Clinical Notes to Explore Sepsis.” BMC Research Notes.Abstract
Background The underrepresentation of exploratory analysis tools for clinical notes has limited the diversity of data insights on medically relevant applications.
Results We characterize how exploratory analysis can affect representation learning on clinical narratives and present several self-developed tools to explore sepsis. Our experiments focus patients with sepsis in the MIMIC-III Clinical Database or in our institutions research patient data repository. We found that global embeddings assist in learning local representations of clinical notes. Second, aligning at any specific time facilitates the use of learning models by pooling more available clinical notes to form a training set. Furthermore, reconstruction of the timeline enhances downstream-processing techniques by emphasizing temporal expressions and temporal relationships in clinical documentation. We demonstrate that clustering helps plot various types of clinical notes against a scale, which conveys a sense of the range or spread of the data and is useful for understanding data correlations. 
Conclusions Appropriate exploratory analysis tools provide keen insights into preprocessing clinical notes, thereby further enhancing downstream analysis capabilities, making data driven medicine possible. Our examples can help generate better data representation of clinical documentation for models with improved performance and interpretability.
Tang C. (corresponding author). Submitted. “HitCompl: Non-Equidistant Dynamic Bayesian Networks for Risk Prediction Using Electronic Health Records.” Data Mining and Knowledge Discovery.Abstract
Patients with chronic diseases are reported to be at risk for unexpected complications which usually cause worsening disease severity: disabilities and even death. This study constructs an unsupervised framework on electronic health records (EHRs), which we call HitCompl, for understanding complication risks to reinforce the patients in managing disease progression. We first retrieve patients with a targeted chronic disease to cluster their exact diagnoses into several groups. Based on non-equidistant dynamic Bayesian networks, we then address the problem of tracking disease severity over time. That is, our approach models the time course of progressive disease status as the irregular key time steps, and then discovers causality among almost all complications relating to the targeted chronic disease. Experiments on real-world EHRs of 9,484 patients with diabetes derived interesting clinical insights on diabetes complications. Our results also demonstrate that the HitCompl framework is often on par with deep learning models in both accuracy and efficiency for training and evaluation.
Tang C. (corresponding author). Submitted. “Unsupervised Synthetic Patient Simulations for Modeling the Progression of Type II Diabetes.” IEEE Journal of Biomedical and Health Informatics.Abstract

Type II diabetes is a preventable chronic disease. People with prediabetes have few symptoms, if any, and don’t discover their condition until complications develop. However, little is known about the progression of type II diabetes from prediabetes to overt diabetes due to a lack of data that can be used to track the natural history of the disease. In this study, we construct an unsupervised progression model for type II diabetes from a corpus of incomplete clinical data. By making use of the generative nature of our model, we introduce the notion of synthetic patients to simulate the entire progression path of type II diabetes. We demonstrate that modeling the full progression trajectory from a set of incomplete longitudinal medical records that only cover short segments of the progression enables prediction of the onset of complications from diabetes. Validation on a real-world patient cohort with type II diabetes associated with intensive care derived some interesting clinical insights like autoimmune diseases, one of the most infrequently reported complications corresponding to type II diabetes.


fig3.jpg fig4.jpg fig1.jpg fig2.jpg
Tang C. (first author). 1/2/2021. “Estimating Time to Progression of Chronic Obstructive Pulmonary Disease with Tolerance.” IEEE Journal of Biomedical and Health Informatics, 1, 25, Pp. 175-180. Publisher's VersionAbstract
This paper proposes a tolerance range of the upper and lower boundaries of a preset time segment for the basic machine learning algorithms such as linear regression (LR) and support vector machines (SVMs) and investigates improvement rate (IR) on the accuracy in predicting mortality risk in patients on a corpus of clinical notes. The corpus includes pulmonary, cardiology, and radiology reports of 15,500 patients with chronic obstructive pulmonary disease who died between 2011 and 2017. Their performance is compared against a state-of-the-art long short-term memory recurrent neural network model. The results demonstrate an overall improvement by machine learning approaches when considering an optimal tolerance range: the average IR of LR is 90.1% and the maximum IR of SVMs is 66.2%. We achieved very similar results to deep learning. In addition, this paper contrasts two temporal visualizations on pulmonary notes, which consisted of representative sentences at each time segment prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model.
IEEE JBHI featured our article as the cover page article.
jbih_removed_figure_2.jpg jbih_removed_figure_4.jpg jbih_fig1fig2_using_visio.docx copd_atlas_comparison_lstmtolerage.pptx 09086080.pdf 09313846.pdf
Tang C. (first author). 12/16/2020. “Data Sovereigns for the World Economy.” Humanities and Social Sciences Communications, 7, 184. Publisher's VersionAbstract
With the rise of data capital and its instantaneous economic effects, existing data-sharing agreements have become complicated and are insufficient for capitalizing on the full value of the data resource. The challenge is to figure out how to derive benefits from data via the right to data portability. Among these, data ownership issues are complex and currently lack a concept that enables the right to data portability, is conducive to the free flow of cross-border data, and assists in the economic agglomeration of cyberspace. We propose defining the term “data sovereign” as a person or entity with the ability to possess and protect the data. First, the word “sovereign” is borrowed from the fundamental economic notion of William H. Hutt’s “consumer sovereignty.” This notion of sovereignty is strengthened by Max Weber’s classic definition of “power” – the ability to possess any resource. We envision that data capital would provide greater “cross-border” convenience for engaging in transactions and exchanges with very different cultures and societies. In our formulation, data sovereign status is achieved when one both possesses the data and can defend any attack on that data. Using “force” to protect data does not imply an abandonment of data sharing. Rather, it should be easy for an organization to enable the sharing of data and data products internally or with trusted partners. Examples of an attack on the data might be a data breach scandal, identity theft, or data terrorism. In the future, numerous tedious, time-consuming, non-artistry, manual occupational tasks can be replaced by data products that are part of a global data economy.
Tang C. (first author). 8/8/2020. “An Annotated Dataset of Tongue Images Supporting Geriatric Disease Diagnosis.” Data in Brief, 32, Pp. 106153. Publisher's VersionAbstract
Hospitalized geriatric patients are a highly heterogeneous group often with variable diseases and conditions. Physicians, and geriatricians especially, are devoted to seeking non-invasive testing tools to support a timely, accurate diagnosis. Chinese tongue diagnosis, mainly based on the color and texture of the tongue, offers a unique solution. To develop a non-invasive assessment tool using machine learning in supporting a timely, accurate diagnosis in the elderly, we created an annotated dataset of 668 tongue images collected from hospitalized geriatric patients in a tertiary hospital in Shanghai, China. Images were captured via a light-field camera using CIELAB color space (to simulate human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system. We expect that the dataset can assist in implementing a systematic means of conducting Chinese tongue diagnosis, predicting geriatric syndromes using tongue appearance, and even developing an mHealth application to provide individualized health suggestions for the elderly.
IEEE DataPort: 699+ reviews, 9 more massages
Harvard Dataverse: 2400+ downloads, 2 more massages
A Data Display: 3 samples and 1 video
Tang C. (first author). 6/6/2020. “A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates.” Annals of Data Science. Publisher's VersionAbstract
This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.
DOI: 10.1007/s40745-020-00296-8
ads_fig6(time complexity).xlsx ads_fig2.png ads_fig6.png ads_fig1.jpg 10.1007_s40745-020-00296-8.pdf
Tang C. (first author). 4/20/2020. “Following Data as it Crosses Borders During the COVID-19 Pandemic.” Journal of the American Medical Informatics Association. Publisher's VersionAbstract
Data changes the game in terms of how we respond to pandemics. Global data on disease trajectories and the effectiveness and economic impact of different social distancing measures are essential to facilitate effective local responses to pandemics. COVID-19 data flowing across geographic borders are extremely useful to public health professionals for many purposes such as accelerating the pharmaceutical development pipeline, and for making vital decisions about intensive care unit rooms, where to build temporary hospitals, or where to boost supplies of personal protection equipment, ventilators, or diagnostic tests. Sharing data enables quicker dissemination and validation of pharmaceutical innovations, as well as improved knowledge of what prevention and mitigation measures work. Even if physical borders around the globe are closed, it is crucial that data continues to transparently flow across borders to enable a data economy to thrive which will promote global public health through global cooperation and solidarity.
hms-todays_news_april_28_2020_paper_chase.pdf BWH research promotion link_JAMIA perspective ocaa063.pdf
Tang C. (coauthor). 12/23/2019. “Heterogeneous network embedding enabling accurate disease association predictions.” BMC Medical Genomics, 12, Suppl 10, Pp. 186. Publisher's VersionAbstract
Background It's significant to elucidate complex biological mechanisms of various diseases in biomedical research. Recently, the growing generation of massive volume of data in genomics, epigenomics, metagenomics, proteomics, metabolomics, nutriomics, etc., has resulted in the rise of systematic biological means of exploring complex diseases. However, the gap between the generation of the multiple data and our ability to analyze them has been broaden gradually. Furthermore, we observe that many of the aforementioned data can be represented by networks, and founded on the vector representations learned by network embedding methods, entities that are close to each other but at present do not have known direct links have high potential to be related and therefore are good candidate subjects for future biological research.
Results We integrate six public databases to construct a heterogeneous network containing three types of entities (i.e., genes, miRNAs, disease). To tackle the inherent heterogeneity, we propose a network embedding method to learn a low-dimensional vector space which best preserves the relationships between conduct disease-gene and disease-miRNA associations predictions, results of which show the superiority of our novel method over several state-of-the-arts. Furthermore, many associations predicted by our method are verified in the latest real-world dataset.
Conclusions We propose a novel heterogeneous network embedding method which can make full use of the rich contextual information and structures of heterogeneous network. We further demonstrate the effectiveness of our method in directing biological experiments, which can assist in identifying new hypotheses in biological investigation.
Tang C. (first author). 12/17/2019. “A Temporal Visualization of Chronic Obstructive Pulmonary Disease Progression Using Deep Learning and Unstructured Clinical Notes.” BMC Medical Informatics and Decision Making, 19, Suppl 8, Pp. 258. Publisher's VersionAbstract
Background Chronic obstructive pulmonary disease (COPD) is a progressive lung disease that is classified into stages based on disease severity. We aimed to characterize the time to progression prior to death in patients with COPD and to generate a temporal visualization that describes signs and symptoms during different stages of COPD progression.
Methods We present a two-step approach for visualizing COPD progression at the level of unstructured clinical notes. We included 15,500 COPD patients who both received care within Partners Healthcare’s network and died between 2011 and 2017. We first propose a four-layer deep learning model that utilizes a specially configured recurrent neural network to capture irregular time lapse segments. Using those irregular time lapse segments, we created a temporal visualization (the COPD atlas) to demonstrate COPD progression, which consisted of representative sentences at each time window prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model. We evaluated our approach on an annotated corpus of COPD patients’ unstructured pulmonary, radiology, and cardiology notes.  
Results Experiments compared to the baselines showed that our proposed approach improved interpretability as well as the accuracy of estimating COPD progression.
Conclusions Our experiments demonstrated that the proposed deep-learning approach to handling temporal variation in COPD progression is feasible and can be used to generate a graphical representation of disease progression using information extracted from clinical notes.
COPD_atlas.ppsx fig1.png fig2.png fig3.png fig4.png A COPD Temporal Visualization Poster s12911-019-0984-8.pdf
Tang C. (first author). 6/28/2019. “Visualizing Literature Review Theme Evolution on Timeline Maps: Comparison Across Disciplines.” IEEE Access, 7, 1, Pp. 90597-90607. Publisher's VersionAbstract
Data-driven visualization techniques can be utilized to enhance the literature review process across different disciplines. In this work, 910 articles were retrieved using keyword search from bibliographic databases of two different disciplines (Computer Science – DBLP and Medicine – MEDLINE) between 2001 and 2016. These articles’ titles were processed using dynamic Latent Dirichlet Allocation to generate a set of themes/topics, which were subsequently classified and assigned to regions in a spatiotemporal geographical map. Resulting data visualizations from both repositories were manually reviewed by independent annotators. The results from DBLP and MEDLINE were comparable and, taken together, suggest potential benefits of increased future interaction among multidisciplinary fields. Our findings indicate that spiral timeline maps have the potential to help researchers acquire or compare knowledge efficiently without prior domain knowledge.
Tang C. (first author). 3/22/2019. “Medication Use for Childhood Pneumonia at a Children's Hospital in Shanghai, China: Analysis of Pattern Mining Algorithms.” JMIR Medical Informatics, 7, 1, Pp. e12577. Publisher's VersionAbstract
Background: Pattern mining utilizes multiple algorithms to explore objective and sometimes unexpected patterns in real-world data. This technique could be applied to electronic medical record data mining; however, it first requires a careful clinical assessment and validation.
Objective: The aim of this study was to examine the use of pattern mining techniques on a large clinical dataset to detect treatment and medication use patterns for childhood pneumonia.
Methods: We applied 3 pattern mining algorithms to 680,138 medication administration records from 30,512 childhood inpatients with diagnosis of pneumonia during a 6-year period at a children’s hospital in China. Patients’ ages ranged from 0 to 17 years, where 37.53% (11,453/30,512) were 0 to 3 months old, 86.55% (26,408/30,512) were under 5 years, 60.37% (18,419/30,512) were male, and 60.10% (18,338/30,512) had a hospital stay of 9 to 15 days. We used the FP-Growth, PrefixSpan, and USpan pattern mining algorithms. The first 2 are more traditional methods of pattern mining and mine a complete set of frequent medication use patterns. PrefixSpan also incorporates an administration sequence. The newer USpan method considers medication utility, defined by the dose, frequency, and timing of use of the 652 individual medications in the dataset. Together, these 3 methods identified the top 10 patterns from 6 age groups, forming a total of 180 distinct medication combinations. These medications encompassed the top 40 (73.66%, 500,982/680,138) most frequently used medications. These patterns were then evaluated by subject matter experts to summarize 5 medication use and 2 treatment patterns.
Results: We identified 5 medication use patterns: (1) antiasthmatics and expectorants and corticosteroids, (2) antibiotics and (antiasthmatics or expectorants or corticosteroids), (3) third-generation cephalosporin antibiotics with (or followed by) traditional antibiotics, (4) antibiotics and (medications for enteritis or skin diseases), and (5) (antiasthmatics or expectorants or corticosteroids) and (medications for enteritis or skin diseases). We also identified 2 frequent treatment patterns: (1) 42.89% (291,701/680,138) of specific medication administration records were of intravenous therapy with antibiotics, diluents, and nutritional supplements and (2) 11.53% (78,390/680,138) were of various combinations of inhalation of antiasthmatics, expectorants, or corticosteroids. Fleiss kappa for the subject experts’ evaluation was 0.693, indicating moderate agreement.
Conclusions: Utilizing a pattern mining approach, we summarized 5 medication use patterns and 2 treatment patterns. These warrant further investigation.
medication_pattern_slices.ppt fig_2.png fig_3.png tangelalpdf.pdf
Tang C. (first author). 11/22/2018. “Rethinking Data Sharing at the Dawn of a Health Data Economy: A Viewpoint.” J Med Internet Res, 20, 11, Pp. e11519. Publisher's VersionAbstract
A healthcare data economy has begun to form, but its rise has been tempered by the profound lack of sharing of both data and data products such as models, intermediate results, and annotated training corpses, and this severely limits the potential for triggering economic cluster effects. Economic cluster effects represent a means to elicit benefit from economies of scale from internal data innovations and are beneficial because they may mitigate challenges from external sources. Within institutions, data product sharing is needed to spark data entrepreneurship and data innovation, and cross-institutional sharing is also critical especially for rare conditions.
Tang C. (first author). 8/23/2017. “Comment Topic Evolution on a Cancer Institution's Facebook Page.” Appl Clin Inform., 8, 3, Pp. 854-865. Publisher's VersionAbstract
Objectives: Our goal was to identify and track the evolution of the topics discussed in free-text comments on a cancer institution’s social media page.
Methods: We utilized the Latent Dirichlet Allocation model to extract ten topics from free-text comments on a cancer research institution’s Facebook™ page between January 1, 2009, and June 30, 2014. We calculated Pearson correlation coefficients between the comment categories to demonstrate topic intensity evolution.
Results: A total of 4,335 comments were included in this study, from which ten topics were identified: greetings (17.3%), comments about the cancer institution (16.7%), blessings (10.9%), time (10.7%), treatment (9.3%), expressions of optimism (7.9%), tumor (7.5%), father figure (6.3%), and other family members & friends (8.2%), leaving 5.1% of comments unclassified. The comment distributions reveal an overall increasing trend during the study period. We discovered a strong positive correlation between greetings and other family members & friends (r=0.88; p<0.001), a positive correlation between blessings and the cancer institution (r=0.65; p<0.05), and a negative correlation between blessings and greetings (r=–0.70; p<0.05).
Conclusions: A cancer institution’s social media platform can provide emotional support to patients and family members. Topic analysis may help institutions better identify and support the needs (emotional, instrumental, and social) of their community and influence their social media strategy.

This paper was selected as one of the 15 candidate best papers (among 32,958 papers) in the cancer informatics section of the 2018 IMIA (International Medical Informatics Association) Yearbook.


2018 IMIA Yearbook_Cancer Informatics 10-4338-aci-2017-04-ra-0055.pdf
Tang C. (first author). 2012. “Similarity Query of Time Series Sub-Sequences Based on LSH.” Jisuanji Xuebao (Chinese Journal of Computers), 11, 35, Pp. 2228-2236. Publisher's VersionAbstract
Subsequence similarity query is an important operation in time series, including range query and k nearest neighbor query. Most of these algorithms are based on the Euclidean distance or DTW distance, weak point of which is the time inefficiencies. We propose a new distance measure based on locality sensitive hash (LSH), which improve the efficiency greatly while ensuring the quality of the query results. We also propose an index structure named DS-Index. Using DS-Index, we prune the candidates of query and thus propose two optimal algorithms: OLSH-Range and OLSH-kNN. Our experiments conducted on real stock exchange transaction sequence datasets show that algorithms can quickly and accurately find similarity query results.

This paper awaraed Sa, Shixuan Best Student Paper at the 29th National Database Conference of China in 2012.


Tang C. (second author). 12/2011. “dbHCCvar: A Comprehensive Database of Human Genetic Variations in Hepatocellular Carcinoma.” Hum Mutat., 32, 12, Pp. E2308-16. Publisher's VersionAbstract
Hepatocellular carcinoma (HCC) is a common cancer with a high mortality rate. The complete pathogenesis of HCC is not completely understood, and highly efficient therapy is still unavailable. In the past several decades, various genetic variations such as mutations and polymorphisms have been reported to be associated with HCC risk, progression, survival, and recurrence. However, to our knowledge, these genetic variations have not been comprehensively and systematically compiled. In this study we constructed dbHCCvar, a free online database of human genetic variations in HCC. Eligible publications were collected from PubMed, and detailed information and major research data from each eligible study were then extracted and recorded in our database. As a result, dbHCCvar contains almost all human genetic variations reported to be associated or not associated with HCC risk, clinical pathology, drug reaction, survival, or recurrence to date. It is expected that dbHCCvar will function as a useful tool for researchers to facilitate the search and identification of new genetic markers for HCC. dbHCCvar is free for all visitors at