In Preparation
Tang C. (first author). In Preparation. “Evaluating Fairness Criteria That Address Disparities in Diabetes.” Journal of the American Medical Informatics Association.Abstract
Objective: We aim to compare the performance of different fairness criteria on their ability to identify disparities in the temporal modeling of minority populations (i.e., African American, Hispanic, Asian) with diabetes to predict the trend of long term outcomes (i.e., improvement, stagnation, decline).
Methods: We utilized a one-step feedback delayed model to evaluate three fairness criteria (i.e., maximum utility, equal opportunity, demographic parity) via synthetic data generated from 459,280 real-world diabetes cases. We characterized long-term outcomes in discussing these fairness criteria and identified where they exhibit qualitatively different behavior.
Results: All three fairness criteria can lead to all possible outcomes (i.e., improvement, stagnation, decline) in natural parameter regimes. Our results demonstrate that, without a careful model of delayed outcomes, both the institution and the patient may cause harm in cases where an unconstrained objective would not.
Conclusions: The delayed model can help foresee the impact a fairness criterion would have if enforced as a constraint in a classification system.
The abstract has been accepted for virtual Discover Brigham 2020.
Tang C. (corresponding author). In Preparation. “Literature Topic Modeling to Detect and Dispel Scientific Misconceptions.” In AMIA 2021 Annual Symposium. San Diego, CA, USA: AMIA.Abstract
In this paper, we investigate scientific misconceptions concerning epidemiological research to explore convergence in big data epidemiology. To achieve this aim, we specifically detect increased concerns by adopting a topic model we previously designed for an exploratory literature review instead of a full manual literature review. In the latter of this study, we first distinguish between misconceptions that are easy to dispel (because they represent simple errors) and misconceptions that are, on the contrary, tough to dispel (because they are the product of pseudo explanations).  We next propose spatiotemporal thinking as a fruitful means to deal with misconceptions. We define the core competencies and knowledge relevant to the practice of spatiotemporal thinking and discuss how they help us avoid misconceptions when converging big data and epidemiology.
Tang C. (first author). In Preparation. “Quantifying Emerging Data Capital: An Experiment in Social Media Clout.” Proceedings of the National Academy of Sciences of the United States of America.Abstract
Network analytics using a force-directed graph drawing algorithm can offer a variety of intuitive processes to understand social gravity. We assume that such gravity is influenced by social media clout, representing interpersonal relationships expressed by language response. While immediate or delayed language responses may define a distant or close relationship between two individuals, this research only considers the evolution of the topics discussed in in-group members (i.e., they are alike) for data privacy concerns. We first utilize a dynamic Latent Dirichlet Allocation model to extract a specified number of topics from two Facebook™ pages (10 topics for each) between October 1, 2016, and September 30, 2018. One page lists a closed group with 18,946 members (as of the same time) created on December 3, 2006; the other has an open group with 11,999 members (as of September 2018) built on May 10, 2008. A total of 3,952 people participates in overlapping. Next, we present techniques for using social gravity as force-directed layouts to produce drawings of complex networks for these topics. The force-directed graphs are an intuitive representation of each topic’s origin and evolution in posts and inter-human relations among the post categories to demonstrate real-time topic intensity. We then evaluate the result networks with the Ramsey Theorem. The two public and private venture-capital groups’ findings provide direct evidence of exploring data capital opportunities offered by social media clout.



Force-directed layout figures (can be opened by google chrome)
Tang C. (corresponding author). Submitted. “Embedding, Aligning, and Reconstructing Clinical Notes to Explore Sepsis.” BMC Research Notes.Abstract
Background The underrepresentation of exploratory analysis tools for clinical notes has limited the diversity of data insights on medically relevant applications.
Results We characterize how exploratory analysis can affect representation learning on clinical narratives and present several self-developed tools to explore sepsis. Our experiments focus patients with sepsis in the MIMIC-III Clinical Database or in our institutions research patient data repository. We found that global embeddings assist in learning local representations of clinical notes. Second, aligning at any specific time facilitates the use of learning models by pooling more available clinical notes to form a training set. Furthermore, reconstruction of the timeline enhances downstream-processing techniques by emphasizing temporal expressions and temporal relationships in clinical documentation. We demonstrate that clustering helps plot various types of clinical notes against a scale, which conveys a sense of the range or spread of the data and is useful for understanding data correlations. 
Conclusions Appropriate exploratory analysis tools provide keen insights into preprocessing clinical notes, thereby further enhancing downstream analysis capabilities, making data driven medicine possible. Our examples can help generate better data representation of clinical documentation for models with improved performance and interpretability.
Tang C. (corresponding author). Submitted. “HitCompl: Non-Equidistant Dynamic Bayesian Networks for Risk Prediction Using Electronic Health Records.” Data Mining and Knowledge Discovery.Abstract
Patients with chronic diseases are reported to be at risk for unexpected complications which usually cause worsening disease severity: disabilities and even death. This study constructs an unsupervised framework on electronic health records (EHRs), which we call HitCompl, for understanding complication risks to reinforce the patients in managing disease progression. We first retrieve patients with a targeted chronic disease to cluster their exact diagnoses into several groups. Based on non-equidistant dynamic Bayesian networks, we then address the problem of tracking disease severity over time. That is, our approach models the time course of progressive disease status as the irregular key time steps, and then discovers causality among almost all complications relating to the targeted chronic disease. Experiments on real-world EHRs of 9,484 patients with diabetes derived interesting clinical insights on diabetes complications. Our results also demonstrate that the HitCompl framework is often on par with deep learning models in both accuracy and efficiency for training and evaluation.
Tang C. (corresponding author). Submitted. “Unsupervised Synthetic Patient Simulations for Modeling the Progression of Type II Diabetes.” IEEE Journal of Biomedical and Health Informatics.Abstract

Type II diabetes is a preventable chronic disease. People with prediabetes have few symptoms, if any, and don’t discover their condition until complications develop. However, little is known about the progression of type II diabetes from prediabetes to overt diabetes due to a lack of data that can be used to track the natural history of the disease. In this study, we construct an unsupervised progression model for type II diabetes from a corpus of incomplete clinical data. By making use of the generative nature of our model, we introduce the notion of synthetic patients to simulate the entire progression path of type II diabetes. We demonstrate that modeling the full progression trajectory from a set of incomplete longitudinal medical records that only cover short segments of the progression enables prediction of the onset of complications from diabetes. Validation on a real-world patient cohort with type II diabetes associated with intensive care derived some interesting clinical insights like autoimmune diseases, one of the most infrequently reported complications corresponding to type II diabetes.


fig3.jpg fig4.jpg fig1.jpg fig2.jpg
Tang C. 2/1/2021. Data Capital: How Data is Reinventing Capital for Globalization. 1st ed., Pp. 391. Cham, Switzerland: Springer International Publishing AG (Signed the Contract in 2016). Publisher's VersionAbstract

This book defines and develops the concept of data capital. Using an interdisciplinary perspective, this book focuses on the key features of the data economy, systematically presenting the economic aspects of data science. The book (1) introduces an alternative interpretation on economists’ observation of which capital has changed radically since the twentieth century; (2) elaborates on the composition of data capital and it as a factor of production; (3) describes morphological changes in data capital that influence its accumulation and circulation; (4) explains the rise of data capital as an underappreciated cause of phenomena from data sovereign, economic inequality, to stagnating productivity; (5) discusses hopes and challenges for industrial circles, the government and academia when an intangible wealth brought by data (and information or knowledge as well); (6) proposes the development of criteria for measuring regulating data capital in the twenty-first century for regulatory purposes by looking at the prospects for data capital and possible impact on future society.

Providing the first a thorough introduction to the theory of data as capital, this book will be useful for those studying economics, data science, and business, as well as those in the financial industry who own, control, or wish to work with data resources. 

Da ta Cap ital, n.
1. A human-created resource that is naturally one capital. 2. A digital, intangible capital form that claims to cover almost the digital part of all existing capital, from tangibles’ digital twin and intangibles’ measurable aspect, to financials. 3. The strategic economic resources for the data economy. 4. A parasitic economic logic to develop new forms of business that serve the industries within the first three categories of Fisher-Clark’s classification. 5. An intangible wealth marked by concentrations of information, knowledge, and wisdom unprecedented in human history. 6. A possible sovereign power that is subordinated to modern global architecture but has no physical boundaries. 7. The origin of a decentralized instrumentation power that asserts dominance over society and brings the opportunities for market democracy.
unnamed_2.jpg booksellerflyer bookcover.jpg Highlghts, Acknowledge, and Contents Part I Part II Part III Part IV List of tables, illustrations, case studies, data sources, and definitions & Index
Tang C. (first author). 1/2/2021. “Estimating Time to Progression of Chronic Obstructive Pulmonary Disease with Tolerance.” IEEE Journal of Biomedical and Health Informatics, 1, 25, Pp. 175-180. Publisher's VersionAbstract
This paper proposes a tolerance range of the upper and lower boundaries of a preset time segment for the basic machine learning algorithms such as linear regression (LR) and support vector machines (SVMs) and investigates improvement rate (IR) on the accuracy in predicting mortality risk in patients on a corpus of clinical notes. The corpus includes pulmonary, cardiology, and radiology reports of 15,500 patients with chronic obstructive pulmonary disease who died between 2011 and 2017. Their performance is compared against a state-of-the-art long short-term memory recurrent neural network model. The results demonstrate an overall improvement by machine learning approaches when considering an optimal tolerance range: the average IR of LR is 90.1% and the maximum IR of SVMs is 66.2%. We achieved very similar results to deep learning. In addition, this paper contrasts two temporal visualizations on pulmonary notes, which consisted of representative sentences at each time segment prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model.
IEEE JBHI featured our article as the cover page article.
jbih_removed_figure_2.jpg jbih_removed_figure_4.jpg jbih_fig1fig2_using_visio.docx copd_atlas_comparison_lstmtolerage.pptx 09086080.pdf 09313846.pdf
Tang C. (first author). 12/16/2020. “Data Sovereigns for the World Economy.” Humanities and Social Sciences Communications, 7, 184. Publisher's VersionAbstract
With the rise of data capital and its instantaneous economic effects, existing data-sharing agreements have become complicated and are insufficient for capitalizing on the full value of the data resource. The challenge is to figure out how to derive benefits from data via the right to data portability. Among these, data ownership issues are complex and currently lack a concept that enables the right to data portability, is conducive to the free flow of cross-border data, and assists in the economic agglomeration of cyberspace. We propose defining the term “data sovereign” as a person or entity with the ability to possess and protect the data. First, the word “sovereign” is borrowed from the fundamental economic notion of William H. Hutt’s “consumer sovereignty.” This notion of sovereignty is strengthened by Max Weber’s classic definition of “power” – the ability to possess any resource. We envision that data capital would provide greater “cross-border” convenience for engaging in transactions and exchanges with very different cultures and societies. In our formulation, data sovereign status is achieved when one both possesses the data and can defend any attack on that data. Using “force” to protect data does not imply an abandonment of data sharing. Rather, it should be easy for an organization to enable the sharing of data and data products internally or with trusted partners. Examples of an attack on the data might be a data breach scandal, identity theft, or data terrorism. In the future, numerous tedious, time-consuming, non-artistry, manual occupational tasks can be replaced by data products that are part of a global data economy.
Tang C. (first author). 8/8/2020. “An Annotated Dataset of Tongue Images Supporting Geriatric Disease Diagnosis.” Data in Brief, 32, Pp. 106153. Publisher's VersionAbstract
Hospitalized geriatric patients are a highly heterogeneous group often with variable diseases and conditions. Physicians, and geriatricians especially, are devoted to seeking non-invasive testing tools to support a timely, accurate diagnosis. Chinese tongue diagnosis, mainly based on the color and texture of the tongue, offers a unique solution. To develop a non-invasive assessment tool using machine learning in supporting a timely, accurate diagnosis in the elderly, we created an annotated dataset of 668 tongue images collected from hospitalized geriatric patients in a tertiary hospital in Shanghai, China. Images were captured via a light-field camera using CIELAB color space (to simulate human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system. We expect that the dataset can assist in implementing a systematic means of conducting Chinese tongue diagnosis, predicting geriatric syndromes using tongue appearance, and even developing an mHealth application to provide individualized health suggestions for the elderly.
IEEE DataPort: 699+ reviews, 9 more massages
Harvard Dataverse: 2400+ downloads, 2 more massages
A Data Display: 3 samples and 1 video
Tang C. (first author). 6/6/2020. “A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates.” Annals of Data Science. Publisher's VersionAbstract
This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.
DOI: 10.1007/s40745-020-00296-8
ads_fig6(time complexity).xlsx ads_fig2.png ads_fig6.png ads_fig1.jpg 10.1007_s40745-020-00296-8.pdf
Tang C. (first author). 4/20/2020. “Following Data as it Crosses Borders During the COVID-19 Pandemic.” Journal of the American Medical Informatics Association. Publisher's VersionAbstract
Data changes the game in terms of how we respond to pandemics. Global data on disease trajectories and the effectiveness and economic impact of different social distancing measures are essential to facilitate effective local responses to pandemics. COVID-19 data flowing across geographic borders are extremely useful to public health professionals for many purposes such as accelerating the pharmaceutical development pipeline, and for making vital decisions about intensive care unit rooms, where to build temporary hospitals, or where to boost supplies of personal protection equipment, ventilators, or diagnostic tests. Sharing data enables quicker dissemination and validation of pharmaceutical innovations, as well as improved knowledge of what prevention and mitigation measures work. Even if physical borders around the globe are closed, it is crucial that data continues to transparently flow across borders to enable a data economy to thrive which will promote global public health through global cooperation and solidarity.
hms-todays_news_april_28_2020_paper_chase.pdf BWH research promotion link_JAMIA perspective ocaa063.pdf
Tang C. (coauthor). 12/23/2019. “Heterogeneous network embedding enabling accurate disease association predictions.” BMC Medical Genomics, 12, Suppl 10, Pp. 186. Publisher's VersionAbstract
Background It's significant to elucidate complex biological mechanisms of various diseases in biomedical research. Recently, the growing generation of massive volume of data in genomics, epigenomics, metagenomics, proteomics, metabolomics, nutriomics, etc., has resulted in the rise of systematic biological means of exploring complex diseases. However, the gap between the generation of the multiple data and our ability to analyze them has been broaden gradually. Furthermore, we observe that many of the aforementioned data can be represented by networks, and founded on the vector representations learned by network embedding methods, entities that are close to each other but at present do not have known direct links have high potential to be related and therefore are good candidate subjects for future biological research.
Results We integrate six public databases to construct a heterogeneous network containing three types of entities (i.e., genes, miRNAs, disease). To tackle the inherent heterogeneity, we propose a network embedding method to learn a low-dimensional vector space which best preserves the relationships between conduct disease-gene and disease-miRNA associations predictions, results of which show the superiority of our novel method over several state-of-the-arts. Furthermore, many associations predicted by our method are verified in the latest real-world dataset.
Conclusions We propose a novel heterogeneous network embedding method which can make full use of the rich contextual information and structures of heterogeneous network. We further demonstrate the effectiveness of our method in directing biological experiments, which can assist in identifying new hypotheses in biological investigation.
Tang C. (first author). 12/17/2019. “A Temporal Visualization of Chronic Obstructive Pulmonary Disease Progression Using Deep Learning and Unstructured Clinical Notes.” BMC Medical Informatics and Decision Making, 19, Suppl 8, Pp. 258. Publisher's VersionAbstract
Background Chronic obstructive pulmonary disease (COPD) is a progressive lung disease that is classified into stages based on disease severity. We aimed to characterize the time to progression prior to death in patients with COPD and to generate a temporal visualization that describes signs and symptoms during different stages of COPD progression.
Methods We present a two-step approach for visualizing COPD progression at the level of unstructured clinical notes. We included 15,500 COPD patients who both received care within Partners Healthcare’s network and died between 2011 and 2017. We first propose a four-layer deep learning model that utilizes a specially configured recurrent neural network to capture irregular time lapse segments. Using those irregular time lapse segments, we created a temporal visualization (the COPD atlas) to demonstrate COPD progression, which consisted of representative sentences at each time window prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model. We evaluated our approach on an annotated corpus of COPD patients’ unstructured pulmonary, radiology, and cardiology notes.  
Results Experiments compared to the baselines showed that our proposed approach improved interpretability as well as the accuracy of estimating COPD progression.
Conclusions Our experiments demonstrated that the proposed deep-learning approach to handling temporal variation in COPD progression is feasible and can be used to generate a graphical representation of disease progression using information extracted from clinical notes.
COPD_atlas.ppsx fig1.png fig2.png fig3.png fig4.png A COPD Temporal Visualization Poster s12911-019-0984-8.pdf
Tang C. (second author). 11/18/2019. “Data Reconstruction Based on Temporal Expressions in Clinical Notes.” In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Pp. 1004-1008. San Diego, CA, USA: IEEE. Publisher's VersionAbstract

Learning representations of clinical notes poses challenges in handling complex content that necessitates preprocessing steps to make the data more suitable for data mining. An important issue, addressed here, is that of temporal expressions, where cues indicate the time when clinical events occur. We present a three-step data reconstruction algorithm for transforming similar clinical entities (e.g., symptoms, complications) into sequential data through unsupervised annotation of temporal expressions. First, the data reconstruction algorithm detects if an expression has temporal intent. Second, it decomposes and rewrites the expression into non-temporal sub-expression and temporal constraints. Finally, it clusters similar non-temporal sub-expressions by using unsupervised sentence embedding under the modified K-medoids paradigm. We experimented with our proposed algorithm on clinical notes associated with chronic obstructive pulmonary disease (COPD). Visualizing reconstruction results of cardiology reports for a longitudinal cohort of patients with COPD demonstrated that this algorithm is feasible.

data_reconstruction_algorithm.pptx b349.pdf
Tang C. (first author). 6/28/2019. “Visualizing Literature Review Theme Evolution on Timeline Maps: Comparison Across Disciplines.” IEEE Access, 7, 1, Pp. 90597-90607. Publisher's VersionAbstract
Data-driven visualization techniques can be utilized to enhance the literature review process across different disciplines. In this work, 910 articles were retrieved using keyword search from bibliographic databases of two different disciplines (Computer Science – DBLP and Medicine – MEDLINE) between 2001 and 2016. These articles’ titles were processed using dynamic Latent Dirichlet Allocation to generate a set of themes/topics, which were subsequently classified and assigned to regions in a spatiotemporal geographical map. Resulting data visualizations from both repositories were manually reviewed by independent annotators. The results from DBLP and MEDLINE were comparable and, taken together, suggest potential benefits of increased future interaction among multidisciplinary fields. Our findings indicate that spiral timeline maps have the potential to help researchers acquire or compare knowledge efficiently without prior domain knowledge.
Tang C. (first author). 3/22/2019. “Medication Use for Childhood Pneumonia at a Children's Hospital in Shanghai, China: Analysis of Pattern Mining Algorithms.” JMIR Medical Informatics, 7, 1, Pp. e12577. Publisher's VersionAbstract
Background: Pattern mining utilizes multiple algorithms to explore objective and sometimes unexpected patterns in real-world data. This technique could be applied to electronic medical record data mining; however, it first requires a careful clinical assessment and validation.
Objective: The aim of this study was to examine the use of pattern mining techniques on a large clinical dataset to detect treatment and medication use patterns for childhood pneumonia.
Methods: We applied 3 pattern mining algorithms to 680,138 medication administration records from 30,512 childhood inpatients with diagnosis of pneumonia during a 6-year period at a children’s hospital in China. Patients’ ages ranged from 0 to 17 years, where 37.53% (11,453/30,512) were 0 to 3 months old, 86.55% (26,408/30,512) were under 5 years, 60.37% (18,419/30,512) were male, and 60.10% (18,338/30,512) had a hospital stay of 9 to 15 days. We used the FP-Growth, PrefixSpan, and USpan pattern mining algorithms. The first 2 are more traditional methods of pattern mining and mine a complete set of frequent medication use patterns. PrefixSpan also incorporates an administration sequence. The newer USpan method considers medication utility, defined by the dose, frequency, and timing of use of the 652 individual medications in the dataset. Together, these 3 methods identified the top 10 patterns from 6 age groups, forming a total of 180 distinct medication combinations. These medications encompassed the top 40 (73.66%, 500,982/680,138) most frequently used medications. These patterns were then evaluated by subject matter experts to summarize 5 medication use and 2 treatment patterns.
Results: We identified 5 medication use patterns: (1) antiasthmatics and expectorants and corticosteroids, (2) antibiotics and (antiasthmatics or expectorants or corticosteroids), (3) third-generation cephalosporin antibiotics with (or followed by) traditional antibiotics, (4) antibiotics and (medications for enteritis or skin diseases), and (5) (antiasthmatics or expectorants or corticosteroids) and (medications for enteritis or skin diseases). We also identified 2 frequent treatment patterns: (1) 42.89% (291,701/680,138) of specific medication administration records were of intravenous therapy with antibiotics, diluents, and nutritional supplements and (2) 11.53% (78,390/680,138) were of various combinations of inhalation of antiasthmatics, expectorants, or corticosteroids. Fleiss kappa for the subject experts’ evaluation was 0.693, indicating moderate agreement.
Conclusions: Utilizing a pattern mining approach, we summarized 5 medication use patterns and 2 treatment patterns. These warrant further investigation.
medication_pattern_slices.ppt fig_2.png fig_3.png tangelalpdf.pdf
Tang C. (first author). 12/3/2018. “A Deep Learning Approach to Handling Temporal Variation in Chronic Obstructive Pulmonary Disease Progression.” In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Pp. 502-509. Madrid, Spain: IEEE.Abstract
Chronic Obstructive Pulmonary Disease (COPD) is a leading cause of mortality in the United States. Representing COPD progression using temporal graphs may offer critical clinical insights. Long-Short Term Memory units in recurrent neural networks can process data with constant elapsed times between consecutive elements of a sequence but cannot handle irregular time intervals (i.e., segments with unequal-time). In this study, we propose a four-layer deep learning model that utilizes a specially configured recurrent neural network to capture irregular time lapse segments. Experiments on a corpus of COPD patients’ clinical notes compared to baseline algorithms showed that our model improved interpretability as well as the accuracy of estimating COPD progression.
Illustration of all three types of clinical notes in COPD patient (Fig. 4@Tableau).
B295.pdf nsf_award.jpg
Tang C. (coauthor). 12/3/2018. “Predicting Disease-Related Associations by Heterogeneous Network Embedding.” In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Pp. 548-555. Madrid, Spain: IEEE.Abstract
Elucidating biological mechanisms underlying complex diseases is an important goal in biomedical research. Recent advances in biological technology have enabled the generation of massive volume of data in genomics, transcriptomics, proteomics, epigenomics, metagenomics, metabolomics, nutriomics, etc., leading to the emergence of systems biology approach to investigating complex diseases. However, most of the data remain underutilized after their initial acquisition and analysis. There is a growing gap between the generation of the multifaceted data and our ability to integrate and analyze them. Inspired by the observation that many of the aforementioned data can be represented by networks, we propose a networkbased model to encapsulate the rich information provided in each database and to connect across different databases. We integrate several public databases to construct a heterogeneous network in which nodes are entities such as genes, miRNAs, diseases, and edges represent known relationships between them. One fundamental challenge is how to perform meaningful analysis on such network, overcoming the intrinsic heterogeneity. We propose a network embedding method to learn a low-dimensional vector space that best preserves the known relationships between entities. Based on the learned vector representations, entities that are close to each other but currently do not have known direct connections, are likely to have an association and therefore are good candidates for future investigation. In the experiments, we construct a heterogeneous network of genes, miRNAs and diseases using data from six public databases. To evaluate the performance of the proposed method, we predict disease-gene and disease-miRNA associations. Comparison of our novel method with several state-of-the-art methods clearly demonstrates the advantage of our method, as it is the only one that takes full advantage of the rich contextual information provided by the heterogeneous network. The encouraging results suggest that our method can provide help in identifying new hypotheses to guide future research. 
Tang C. (first author). 11/22/2018. “Rethinking Data Sharing at the Dawn of a Health Data Economy: A Viewpoint.” J Med Internet Res, 20, 11, Pp. e11519. Publisher's VersionAbstract
A healthcare data economy has begun to form, but its rise has been tempered by the profound lack of sharing of both data and data products such as models, intermediate results, and annotated training corpses, and this severely limits the potential for triggering economic cluster effects. Economic cluster effects represent a means to elicit benefit from economies of scale from internal data innovations and are beneficial because they may mitigate challenges from external sources. Within institutions, data product sharing is needed to spark data entrepreneurship and data innovation, and cross-institutional sharing is also critical especially for rare conditions.