In Preparation
Tang C. (first author). In Preparation. “Addressing Disparities in Diabetes Using Temporal Fairness Models.” Journal of the American Medical Informatics Association.Abstract
To address disparities in diabetes, we optimized a one-step feedback delayed model to evaluate three fairness criteria (i.e., maximum utility, equal opportunity, demographic parity) in 459,280 real-world diabetes cases. Temporal fairness models can help foresee the impact a fairness criterion would have if enforced as a constraint in a classification system. Our main finding is that disparities exist as utility-maximizing results in relative harm but that this might be mitigated with an unconstrained utility objective.
The abstract got accepted by the virtual Discover Brigham 2020.
The podium abstract got accepted by the AMIA 2021 Annual Symposium.
a prior version.pptx
Tang C. (corresponding author). In Preparation. “Evaluating Fairness Criteria in Obesity Subgroups to Assess Risk for Incident Diabetic Complications.” JAMA Network Open.Abstract
Objectives To compare how fairness criteria (maximum utility, equal opportunity, demographic parity) perform at predicting complications in subgroups of overweight and obese patients with type 2 diabetes.
Methods We conducted a retrospective cohort study of 459,280 patients with type 2 diabetes extracted from Mass General Brigham’s research patient data repository from January 2011 to December 2020. We characterized several obesity subgroups according to race/ethnicity (Caucasian, African American, Hispanic, and Asian), age (≤45 or not), gender (male and female), and body mass index (overweight 25-29.9, affected by obesity 30-39.9, and morbid obesity 40 or greater). Maximum utility is focused on lowering the complication risk probability, as this will benefit the institution. Demographic parity results in equal selection rates across groups. Equal opportunity allows patients to independently choose to participate, resulting in equal true positive rates across groups. Fairness criteria based temporal machine learning models were utilized to describe and predict the risk of 11 incident diabetic complications (acute complications, cardiovascular, nephropathy, ophthalmopathy, peripheral vascular, cerebrovascular, neuropathy, metabolic complications, tumor, musculoskeletal, autoimmune diseases) through the derivation of outcome curves. 
Results The outcome curves can be interpreted; for example, the maximum utility model was primarily in a state of relative harm (i.e., higher risk of cardiovascular, nephropathy, and metabolic complications) for the African American and Hispanic subgroups with morbid obesity when compared with the equal opportunity and demographic parity models.
Conclusions Our results suggest that fairness criteria can help select amongst trade-offs on personalized care, leading to a range of treatment planning requirements, and the ability to decrease medical costs.
The abstract was accepted for a poster presentation at the 9th Annual Obesity Research Incubator Session.
Tang C. (first author). In Preparation. “Evaluation of Hospital Rating Systems Through the Lens of Data Capital”.Abstract
Publicly reported quality and safety rating systems represent promising innovations for rating hospital performance. Still, rating the raters via the data is needed for meaningful comparisons and to establish integrated oversight. Data is now a kind of capital, on par with social and human capital, affecting hospital services. Using the lens of data capital, these ranking systems’ content (e.g., data, metrics) can be combined into a composite to offer another perspective.
Usnws network Usnews_1_Method Manual Vizient_2_Method Manual CMS_3__Method Manual Leapfrog_4__Method Manual Truven_5__Method Manual Newsweek_6_Method Manual.pdf semantic_triple.rar
Tang C. (first author). In Preparation. “Mobile Image Analysis for Urinalysis Strips Using Backpropagation Neural Network”.Abstract
Background Urine analysis has great potential in personalized care, considering either its biological richness or its capacity to be a convenient and cost-effective medium for continuous health monitoring. Involved diagnostics include, but not limited to, urinary tract infection, kidney function, diabetes, pregnancy, and hydration testing. Smartphone and portable (or wearable) devices incorporate image sensors, offering a practical, accurate, and low-cost solution for initial self-diagnosis of disease, self-monitoring of health conditions, and preliminary examinations. This can help to develop new mHealth applications (app).
Objectives This study aims to (1) develop a mHealth app calling a model based on backpropagation (BP) neural network we proposed for urinalysis strips image analysis, then (2) evaluate the feasibility of embedding model parameters to give consumers control personal data by image processing on mobile devices.
Methods We proposed a novel BP neural network-based model to identify color similarity in these images shot by smartphone users and a standard colorimetric card. Our dataset contains 5,620 labeled urinalysis strip images. We chose four existing image recognition models as the baselines. We designed two versions of the apps for our evaluation purpose. One is a normal informed consent-based personal data collector to have users’ data for image processing on the server. The other is embedded model parameters to achieve mobile image analysis.
Results We experimented with our proposed model on our labeled dataset by randomly selecting two-third of these images as training data and the rest as testing data. The results indicate that our model performs much better than all baselines in a total of 5 testing items, with a maximum improvement rate of 28.2% and an average of 16.9%. We evaluated the two versions of apps by a sub dataset (457 urinalysis strip images). The findings demonstrate the accuracy, efficiency, and consistency of the two are similar.
Conclusions While the rich new streams of data have made it possible to tackle complex challenges in fields such as health care, we should be open about our data practices on new smart, connected products to ensure individuals’ privacy choose. It is feasible to facilitate both parties benefit from personal data collection via app design.  
The abstract got accepted by the virtual Discover Brigham 2021.
uafig3.jpg uafig1.jpg uafig4.jpg uafig2.jpg
Tang C. (first author). In Preparation. “Multimodal Deep Learning in Health Care: Hopes and Challenges.” NEJM Catalyst.Abstract
An invited article in NEJM Catalyst AI theme issue.
Literature Review on Automatic Image-based Medical Report Generation
Tang C. (corresponding author). Submitted. “HitCompl: Non-Equidistant Dynamic Bayesian Networks for Risk Prediction Using Electronic Health Records.” In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Online: IEEE.Abstract
Patients with chronic diseases are reported to be at risk for unexpected complications which usually cause worsening disease severity: disabilities and even death. This study constructs an unsupervised framework on electronic health records (EHRs), which we call HitCompl, for understanding complication risks to reinforce the patients in managing disease progression. We first retrieve patients with a targeted chronic disease to cluster their exact diagnoses into several groups. Based on non-equidistant dynamic Bayesian networks, we then address the problem of tracking disease severity over time. That is, our approach models the time course of progressive disease status as the irregular key time steps, and then discovers causality among almost all complications relating to the targeted chronic disease. Experiments on real-world EHRs of 9,484 patients with diabetes derived interesting clinical insights on diabetes complications. Our results also demonstrate that the HitCompl framework is often on par with deep learning models in both accuracy and efficiency for training and evaluation.
Tang C. (corresponding author). Submitted. “Improving Research Patient Data Repositories from a Health Data Industry Viewpoint.” Journal of Medical Internet Research.Abstract

Electronic patient data is critical to clinical and translational science, thereby improving its infrastructures: research patient data repository (RPDR) is undoubtedly a major endeavor to support any strategies in biomedical data science. But the data science ecosystem, due to its inherently transdisciplinary nature, poses challenges to existing RPDRs and demands the creation of new ones. These call for a wide variety of functions, capabilities, and needs in administrative, educational, and organizational domains, to name just a few. The power of data science in the business realm is tremendous. It can reshape almost all prevailing views to generate a data industry viewpoint of how people interact with the data value. This perspective is dedicated to borrowing this viewpoint to promote RPDR in best practices and innovations generated, by showcasing previously unseen problems. These include deployment, contribution calculation, internal talent marketplace, data partnership, data sovereigns’ new capital assets, and cross-border data sharing.

Tang C. (corresponding author). Submitted. “Positive-Unlabeled Learning to Address Undercoding Problems When Using Deep Learning for Automated Clinical Coding.” In IEEE Journal of Biomedical and Health Informatics. Online: IEEE.Abstract
It is an understudied task to deploy and implement any deep models using a large corpus with inconsistent and incorrect coding (including undercoding). Poor medical coding accuracy can mislead the learning process of deep models. This study proposed a novel plugin algorithm through positive-unlabeled learning, supporting learning models for undercoding clinical documents. Here we assume each code has the same probability of undercoding. Experiments conducted on a popular MIMIC-III dataset correlated with 46,157 discharge summaries from more than 40,000 patients, demonstrating the usefulness of the positive-unlabeled loss, with 22.32, 2.09, and 20.83 absolute improvements on Mi-F (micro F1), Ma-F (macro F1), and EBF (example-based F1) scores respectively. Our results show that the positive-unlabeled learning can relax the positive prediction penalty for codes not assigned to the samples.
Tang C. (first author). Submitted. “Quantifying Emerging Data Capital: An Experiment in Social Media Clout.” Harvard Data Science Review.Abstract

Significance An alternative method of using social media clout to explain the accumulation of capital assets is proposed. Measuring capital and its impacts on both individuals and society are of great interest to economists, social scientists, and policymakers. In this study, we make two interdisciplinary contributions from the data science perspective. First, we classified capital assets as data capital or non-data capital instead of tangibles or intangibles. Second, we introduce a data-driven experiment on social media clout to determine how individuals used capital accumulation for their own career or business prospects on social media platforms. Our findings indicate that social media data is both a social and economic resource and that users can utilize social media to maximize their own clout to gain potential benefits.  

Data and code to replicate our results are available at As the two force-directed networks are dynamic, each run may appear to be slightly different.
Has now entered the full manuscript invited stage.
A result map Subgroups corresponding to Appendix Table 3 Appendix
Tang C. (corresponding author). Submitted. “Using an Optimized Generative Model to Infer the Progression of Complications in Type 2 Diabetes Patients.” Computer Methods and Programs in Biomedicine.Abstract
Background People with type 2 diabetes have few symptoms, if any, and don’t discover their condition until complications develop. However, little is known about the progression of complications in type 2 diabetes due to data defects in electronic health records (e.g., incomplete records, discrete observation, irregular visits, and progression heterogeneity).
Objectives The aim of this study was to optimize a generative model to infer the stage of onset of associated complications in patients with type 2 diabetes.
Materials and Methods Our study utilized real world longitudinal electronic health record data from 9,298 patients across an 11-year timespan from a 17-hospital-based regional healthcare delivery network in Shanghai, China who were diagnosed with type 2 diabetes or prediabetic. We used an optimized generative Markov-Bayesian-based model to generate 5000 synthetic illness trajectories, which were manually reviewed by endocrinologists.
Results Optimizations using anchor information to set model parameters perfected highly sparse, noisy irregular, and over discrete EHRs, to be a specified number of entire synthetic illness trajectories coped with diabetes-related complications. Given a target stage, it is straightforward to infer the risks of any complications at other stages, not merely transitioning from an earlier state to a later state but from a later state to an earlier state.
Conclusions Synthetic patient trajectories simulated by the generative model can counter a lack of real world evidence of desired longitudinal timeframe, to contribution treatment process discovery and conformance checking. 
fig2.jpg fig4.jpg fig1.jpg fig3.jpg
Tang C. (corresponding author). Forthcoming. “Deep Reinforcement Learning for Transportation Network Combinatorial Optimization: A Survey.” Knowledge-Based Systems.Abstract
Traveling salesman and vehicle routing problems with their variants, as classic combinatorial optimization problems, have attracted considerable attention for decades of their theoretical and practical value. Many classic algorithms have been proposed, for example, exact algorithms, heuristic algorithms, solution solvers, etc. Still, due to their complexity, even the most advanced traditional methods require too much computational time or are not well-defined mathematically; algorithm-based decision-making is no exception. Also, these methods cannot be generalized to a larger scale or other similar problems. With the latest developments in machine and deep learning, people believe it is feasible to apply reinforcement learning and other technologies in the decision-making or heuristic for learning combinatorial optimization. In this paper, we first gave an overview on how combinate deep reinforcement learning for the NP-hard combinatorial optimization, emphasizing general optimization problems as data points and exploring the relevant distribution of data used for learning in a given task. We next reviewed state-of-art learning techniques related to combinational optimization problems on graphs. Then, we summarized the experimental methods of using reinforcement learning to solve combinatorial optimization problems and analyzed the performance comparison of different algorithms. Lastly, we sorted out the challenges encountered by deep reinforcement learning in solving combinatorial optimization problems and future research directions.
Tang C. (first author). 9/12/2021. “The Intersection of Big Data and Epidemiology for Epidemiologic Research: The Impact of the COVID-19 Pandemic.” International Journal for Quality in Health Care. Publisher's VersionAbstract
Big data epidemiology facilitates pandemic response by providing data-driven insights by utilizing big data tools that differ from traditional methods. Aspects regarding ‘garbage in, garbage out’, such as insufficient data, inaccessibility of data, missing data, uncertainty in handling data and bias in analysis or common findings are addressable by combining techniques across disciplines.
Graphical Abstract
Tang C. (first author). 4/14/2021. “Embedding, Aligning, and Reconstructing Clinical Notes to Explore Sepsis.” BMC Research Notes, 14, Pp. 136. Publisher's VersionAbstract
Objectives Our goal was to research and develop exploratory analysis tools for clinical notes, which now are underrepresented to limit the diversity of data insights on medically relevant applications.
Results We characterize how exploratory analysis can affect representation learning on clinical narratives and present several self-developed tools to explore sepsis. Our experiments focus on patients with sepsis in the MIMIC-III Clinical Database or in our institution’s research patient data repository. We found that global embeddings assist in learning local representations of clinical notes. Second, aligning at any specific time facilitates the use of learning models by pooling more available clinical notes to form a training set. Furthermore, reconstruction of the timeline enhances downstream-processing techniques by emphasizing temporal expressions and temporal relationships in clinical documentation. We demonstrate that clustering helps plot various types of clinical notes against a scale, which conveys a sense of the range or spread of the data and is useful for understanding data correlations. Appropriate exploratory analysis tools provide keen insights into preprocessing clinical notes, thereby further enhancing downstream analysis capabilities, making data driven medicine possible. Our examples can help generate better data representation of clinical documentation for models with improved performance and interpretability..
Tang C. 2/1/2021. Data Capital: How Data is Reinventing Capital for Globalization. 1st ed., Pp. 391. Cham, Switzerland: Springer International Publishing AG (Signed the Contract in 2016). Publisher's VersionAbstract

This book defines and develops the concept of data capital. Using an interdisciplinary perspective, this book focuses on the key features of the data economy, systematically presenting the economic aspects of data science. The book (1) introduces an alternative interpretation on economists’ observation of which capital has changed radically since the twentieth century; (2) elaborates on the composition of data capital and it as a factor of production; (3) describes morphological changes in data capital that influence its accumulation and circulation; (4) explains the rise of data capital as an underappreciated cause of phenomena from data sovereign, economic inequality, to stagnating productivity; (5) discusses hopes and challenges for industrial circles, the government and academia when an intangible wealth brought by data (and information or knowledge as well); (6) proposes the development of criteria for measuring regulating data capital in the twenty-first century for regulatory purposes by looking at the prospects for data capital and possible impact on future society.

Providing the first a thorough introduction to the theory of data as capital, this book will be useful for those studying economics, data science, and business, as well as those in the financial industry who own, control, or wish to work with data resources. 

Da ta Cap ital, n.
1. A human-created resource that is naturally one capital. 2. A digital, intangible capital form that claims to cover almost the digital part of all existing capital, from tangibles’ digital twin and intangibles’ measurable aspect, to financials. 3. The strategic economic resources for the data economy. 4. A parasitic economic logic to develop new forms of business that serve the industries within the first three categories of Fisher-Clark’s classification. 5. An intangible wealth marked by concentrations of information, knowledge, and wisdom unprecedented in human history. 6. A possible sovereign power that is subordinated to modern global architecture but has no physical boundaries. 7. The origin of a decentralized instrumentation power that asserts dominance over society and brings the opportunities for market democracy.
unnamed_2.jpg booksellerflyer bookcover.jpg Highlghts, Acknowledge, and Contents Part I Part II Part III Part IV List of tables, illustrations, case studies, data sources, and definitions & Index
Tang C. (first author). 1/2/2021. “Estimating Time to Progression of Chronic Obstructive Pulmonary Disease with Tolerance.” IEEE Journal of Biomedical and Health Informatics, 1, 25, Pp. 175-180. Publisher's VersionAbstract
We defined tolerance range as the distance of observing similar disease conditions or functional status from the upper to the lower boundaries of a specified time interval. A tolerance range was identified for linear regression and support vector machines to optimize the improvement rate (defined as IR) on accuracy in predicting mortality risk in patients with chronic obstructive pulmonary disease using clinical notes. The corpus includes pulmonary, cardiology, and radiology reports of 15,500 patients who died between 2011 and 2017. Their performance was compared against a long short-term memory recurrent neural network. The results demonstrate an overall improvement by those basic machine learning approaches after considering an optimal tolerance range: the average IR of linear regression was 90.1% and the maximum IR of support vector machines was 66.2%. There was a similitude between the time segments produced by our tolerance algorithms and those produced by the long short-term memory.
IEEE JBHI featured our article as the cover page article.
jbih_removed_figure_2.jpg jbih_removed_figure_4.jpg jbih_fig1fig2_using_visio.docx copd_atlas_comparison_lstmtolerage.pptx 09086080.pdf 09313846.pdf
Tang C. (first author). 12/16/2020. “Data Sovereigns for the World Economy.” Humanities and Social Sciences Communications, 7, 184. Publisher's VersionAbstract

With the rise of data capital and its instantaneous economic effects, existing data-sharing agreements have become complicated and are insufficient for capitalizing on the full value of the data resource. The challenge is to figure out how to derive benefits from data via the right to data portability. Among these, data ownership issues are complex and currently lack a concept that enables the right to data portability, is conducive to the free flow of cross-border data, and assists in the economic agglomeration of cyberspace. We propose defining the term “data sovereign” as a person or entity with the ability to possess and protect the data. First, the word “sovereign” is borrowed from the fundamental economic notion of William H. Hutt’s “consumer sovereignty.” This notion of sovereignty is strengthened by Max Weber’s classic definition of “power” – the ability to possess any resource. We envision that data capital would provide greater “cross-border” convenience for engaging in transactions and exchanges with very different cultures and societies. In our formulation, data sovereign status is achieved when one both possesses the data and can defend any attack on that data. Using “force” to protect data does not imply an abandonment of data sharing. Rather, it should be easy for an organization to enable the sharing of data and data products internally or with trusted partners. Examples of an attack on the data might be a data breach scandal, identity theft, or data terrorism. In the future, numerous tedious, time-consuming, non-artistry, manual occupational tasks can be replaced by data products that are part of a global data economy.


Tang C. (first author). 8/8/2020. “An Annotated Dataset of Tongue Images Supporting Geriatric Disease Diagnosis.” Data in Brief, 32, Pp. 106153. Publisher's VersionAbstract
Hospitalized geriatric patients are a highly heterogeneous group often with variable diseases and conditions. Physicians, and geriatricians especially, are devoted to seeking non-invasive testing tools to support a timely, accurate diagnosis. Chinese tongue diagnosis, mainly based on the color and texture of the tongue, offers a unique solution. To develop a non-invasive assessment tool using machine learning in supporting a timely, accurate diagnosis in the elderly, we created an annotated dataset of 668 tongue images collected from hospitalized geriatric patients in a tertiary hospital in Shanghai, China. Images were captured via a light-field camera using CIELAB color space (to simulate human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system. We expect that the dataset can assist in implementing a systematic means of conducting Chinese tongue diagnosis, predicting geriatric syndromes using tongue appearance, and even developing an mHealth application to provide individualized health suggestions for the elderly.
IEEE DataPort: 699+ reviews, 9 more massages
Harvard Dataverse: 2400+ downloads, 2 more massages
A Data Display: 3 samples and 1 video
Tang C. (first author). 6/6/2020. “A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates.” Annals of Data Science. Publisher's VersionAbstract
This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.
DOI: 10.1007/s40745-020-00296-8
ads_fig6(time complexity).xlsx ads_fig2.png ads_fig6.png ads_fig1.jpg 10.1007_s40745-020-00296-8.pdf
Tang C. (first author). 4/20/2020. “Following Data as it Crosses Borders During the COVID-19 Pandemic.” Journal of the American Medical Informatics Association. Publisher's VersionAbstract
Data changes the game in terms of how we respond to pandemics. Global data on disease trajectories and the effectiveness and economic impact of different social distancing measures are essential to facilitate effective local responses to pandemics. COVID-19 data flowing across geographic borders are extremely useful to public health professionals for many purposes such as accelerating the pharmaceutical development pipeline, and for making vital decisions about intensive care unit rooms, where to build temporary hospitals, or where to boost supplies of personal protection equipment, ventilators, or diagnostic tests. Sharing data enables quicker dissemination and validation of pharmaceutical innovations, as well as improved knowledge of what prevention and mitigation measures work. Even if physical borders around the globe are closed, it is crucial that data continues to transparently flow across borders to enable a data economy to thrive which will promote global public health through global cooperation and solidarity.

This paper was mentioned by the 2021 IMIA Yearbook twice: in a survey of health informatics and health information management during the COVID-19 pandemic and in a survey of clinical information systems

hms-todays_news_april_28_2020_paper_chase.pdf BWH research promotion link_JAMIA perspective ocaa063.pdf
Tang C. (coauthor). 12/23/2019. “Heterogeneous network embedding enabling accurate disease association predictions.” BMC Medical Genomics, 12, Suppl 10, Pp. 186. Publisher's VersionAbstract
Background It's significant to elucidate complex biological mechanisms of various diseases in biomedical research. Recently, the growing generation of massive volume of data in genomics, epigenomics, metagenomics, proteomics, metabolomics, nutriomics, etc., has resulted in the rise of systematic biological means of exploring complex diseases. However, the gap between the generation of the multiple data and our ability to analyze them has been broaden gradually. Furthermore, we observe that many of the aforementioned data can be represented by networks, and founded on the vector representations learned by network embedding methods, entities that are close to each other but at present do not have known direct links have high potential to be related and therefore are good candidate subjects for future biological research.
Results We integrate six public databases to construct a heterogeneous network containing three types of entities (i.e., genes, miRNAs, disease). To tackle the inherent heterogeneity, we propose a network embedding method to learn a low-dimensional vector space which best preserves the relationships between conduct disease-gene and disease-miRNA associations predictions, results of which show the superiority of our novel method over several state-of-the-arts. Furthermore, many associations predicted by our method are verified in the latest real-world dataset.
Conclusions We propose a novel heterogeneous network embedding method which can make full use of the rich contextual information and structures of heterogeneous network. We further demonstrate the effectiveness of our method in directing biological experiments, which can assist in identifying new hypotheses in biological investigation.