Publications by Type: Journal Article

In Preparation
Tang C. (first author). In Preparation. “Addressing Disparities in Diabetes Using Temporal Fairness Models.” Journal of the American Medical Informatics Association.Abstract
To address disparities in diabetes, we optimized a one-step feedback delayed model to evaluate three fairness criteria (i.e., maximum utility, equal opportunity, demographic parity) in 459,280 real-world diabetes cases. Temporal fairness models can help foresee the impact a fairness criterion would have if enforced as a constraint in a classification system. Our main finding is that disparities exist as utility-maximizing results in relative harm but that this might be mitigated with an unconstrained utility objective.
The abstract has been accepted for virtual Discover Brigham 2020.
a prior version.pptx
Tang C. (corresponding author). In Preparation. “Evaluating Fairness Criteria in Obesity Subgroups to Assess Risk for Incident Diabetic Complications.” JAMA Network Open.Abstract
Objectives To compare how fairness criteria (maximum utility, equal opportunity, demographic parity) perform at predicting complications in subgroups of overweight and obese patients with type 2 diabetes.
Methods We conducted a retrospective cohort study of 459,280 patients with type 2 diabetes extracted from Mass General Brigham’s research patient data repository from January 2011 to December 2020. We characterized several obesity subgroups according to race/ethnicity (Caucasian, African American, Hispanic, and Asian), age (≤45 or not), gender (male and female), and body mass index (overweight 25-29.9, affected by obesity 30-39.9, and morbid obesity 40 or greater). Maximum utility is focused on lowering the complication risk probability, as this will benefit the institution. Demographic parity results in equal selection rates across groups. Equal opportunity allows patients to independently choose to participate, resulting in equal true positive rates across groups. Fairness criteria based temporal machine learning models were utilized to describe and predict the risk of 11 incident diabetic complications (acute complications, cardiovascular, nephropathy, ophthalmopathy, peripheral vascular, cerebrovascular, neuropathy, metabolic complications, tumor, musculoskeletal, autoimmune diseases) through the derivation of outcome curves. 
Results The outcome curves can be interpreted; for example, the maximum utility model was primarily in a state of relative harm (i.e., higher risk of cardiovascular, nephropathy, and metabolic complications) for the African American and Hispanic subgroups with morbid obesity when compared with the equal opportunity and demographic parity models.
Conclusions Our results suggest that fairness criteria can help select amongst trade-offs on personalized care, leading to a range of treatment planning requirements, and the ability to decrease medical costs.
The abstract was accepted for a poster presentation at the 9th Annual Obesity Research Incubator Session.
Tang C. (first author). In Preparation. “Mobile Image Analysis for Urinalysis Strips Using Backpropagation Neural Network.” JMIR mHealth and uHealth .Abstract
Background Urine analysis has great potential in personalized care, considering either its biological richness or its capacity to be a convenient and cost-effective medium for continuous health monitoring. Involved diagnostics include, but not limited to, urinary tract infection, kidney function, diabetes, pregnancy, and hydration testing. Smartphone and portable (or wearable) devices incorporate image sensors, offering a practical, accurate, and low-cost solution for initial self-diagnosis of disease, self-monitoring of health conditions, and preliminary examinations. This can help to develop new mHealth applications (app).
Objectives This study aims to (1) develop a mHealth app calling a model based on backpropagation (BP) neural network we proposed for urinalysis strips image analysis, then (2) evaluate the feasibility of embedding model parameters to give consumers control personal data by image processing on mobile devices.
Methods We proposed a novel BP neural network-based model to identify color similarity in these images shot by smartphone users and a standard colorimetric card. Our dataset contains 5,620 labeled urinalysis strip images. We chose four existing image recognition models as the baselines. We designed two versions of the apps for our evaluation purpose. One is a normal informed consent-based personal data collector to have users’ data for image processing on the server. The other is embedded model parameters to achieve mobile image analysis.
Results We experimented with our proposed model on our labeled dataset by randomly selecting two-third of these images as training data and the rest as testing data. The results indicate that our model performs much better than all baselines in a total of 5 testing items, with a maximum improvement rate of 28.2% and an average of 16.9%. We evaluated the two versions of apps by a sub dataset (457 urinalysis strip images). The findings demonstrate the accuracy, efficiency, and consistency of the two are similar.
Conclusions While the rich new streams of data have made it possible to tackle complex challenges in fields such as health care, we should be open about our data practices on new smart, connected products to ensure individuals’ privacy choose. It is feasible to facilitate both parties benefit from personal data collection via app design.  
Tang C. (first author). In Preparation. “Quantifying Emerging Data Capital: An Experiment in Social Media Clout.” PNAS.Abstract

Force-directed network analytics provide a very intuitive understanding of the layout process and its settings, like the Barnes Hut simulation (Ventimiglia and Wayne, 2003). Social gravity stresses the “moral” imperative for like-minded people to forge bonds dedicated to the common good. We assume that capital assets could be with social gravity; that is, the sequences of linkages and social relationships influenced by social media clout. To offset the black-box analysis for network functions, we utilize a dynamic Latent Dirichlet Allocation model to detect these language responses. While immediate or delayed language responses may define a distant or close relationship between two individuals, this research only considers the evolution of the topics discussed by in-group members (i.e., they are alike) due to data privacy concerns.  We first extracted a specified number of topics from two venture capital groups’ Facebook™ pages (10 topics each) between October 1, 2016, and September 30, 2018. The private group was with 18,946 members, and the public one had 11,999 members. 3,952 members participated in both groups. Next, we present techniques for using social gravity as manifest in force-directed graphs to produce visualizations of the complex network of topics discussed in each group. The force-directed graphs are an intuitive representation of each topic’s origin and evolution in posts and interhuman relations among the post categories to demonstrate real-time topic intensity. We evaluated the networks with Ramsey’s Theorem (Weisstein, 1999). Our findings provide direct evidence in favor of exploring data capital opportunities offered by social media clout.

Significance An alternative method of using social media clout to explain the accumulation of capital assets is proposed. Measuring capital and its impacts on both individuals and society are of great interest to economists, social scientists, and policymakers. In this study, we make two interdisciplinary contributions from the data science perspective. First, we classified capital assets as data capital or non-data capital instead of tangibles or intangibles. Second, we introduce a data-driven experiment on social media clout to determine how individuals used capital accumulation for their own career or business prospects on social media platforms. Our findings indicate that social media data is both a social and economic resource and that users can utilize social media to maximize their own clout to gain potential benefits.  
Data and code to replicate our results are available at As the two force-directed networks are dynamic, each run may appear to be slightly different.
A result map Subgroups corresponding to Appendix Table 3 Appendix
Tang C. (corresponding author). Submitted. “HitCompl: Non-Equidistant Dynamic Bayesian Networks for Risk Prediction Using Electronic Health Records.” Data Mining and Knowledge Discovery.Abstract
Patients with chronic diseases are reported to be at risk for unexpected complications which usually cause worsening disease severity: disabilities and even death. This study constructs an unsupervised framework on electronic health records (EHRs), which we call HitCompl, for understanding complication risks to reinforce the patients in managing disease progression. We first retrieve patients with a targeted chronic disease to cluster their exact diagnoses into several groups. Based on non-equidistant dynamic Bayesian networks, we then address the problem of tracking disease severity over time. That is, our approach models the time course of progressive disease status as the irregular key time steps, and then discovers causality among almost all complications relating to the targeted chronic disease. Experiments on real-world EHRs of 9,484 patients with diabetes derived interesting clinical insights on diabetes complications. Our results also demonstrate that the HitCompl framework is often on par with deep learning models in both accuracy and efficiency for training and evaluation.
Tang C. (first author). Submitted. “The Intersection of Big Data and Epidemiology for Epidemiologic Research.” International Journal for Quality in Health Care.Abstract
The sudden rise of big data in the public health sphere has led to an increased misconception like “garbage in, garbage out” compared to traditional epidemiological methods. In actuality, big data is comprised of three critical elements: data, technology, and application. Common in big data and epidemiology is a focus on the approaches for solving intricate problems. Perhaps mutually exclusive preferences of the two fields can warrant a tighter integration. For example, epidemiologists are well versed in the science of study design and the art of causal inference. Data scientists have expertise in computational and visualization approaches for spatiotemporal data of high dimensionality. The intersection of big data and epidemiology can change the game regarding how we respond to pandemics like COVID-19. We recommend a population level of thinking combined with spatiotemporal data analysis that has great potential to transform big data and epidemiology, respectively.
Tang C. (corresponding author). Submitted. “Using an Optimized Generative Model to Infer the Progression of Complications in Type 2 Diabetes Patients.” JMIR Medical Informatics.Abstract
Background People with type 2 diabetes have few symptoms, if any, and don’t discover their condition until complications develop. However, little is known about the progression of complications in type 2 diabetes due to data defects in electronic health records (EHRs, e.g., incomplete records, discrete observation, irregular visits, and progression heterogeneity).
Objectives The aim of this study was to optimize a generative model to infer the stage of onset of associated complications in patients with type 2 diabetes.
Materials and Methods Our study utilized real world longitudinal electronic health record data from 9,298 patients across an 11-year timespan from a 17-hospital-based regional healthcare delivery network in Shanghai, China who were diagnosed with type 2 diabetes or prediabetic. We used an optimized generative Markov-Bayesian-based model to generate 5000 synthetic illness trajectories, which were manually reviewed by endocrinologists.
Results Optimizations using anchor information to set model parameters perfected highly sparse, noisy irregular, and over discrete EHRs, to be a specified number of entire synthetic illness trajectories coped with diabetes-related complications.
Discussion Given a target stage, it is straightforward to infer the risks of any complications at other stages, not merely transitioning from an earlier state to a later state but from a later state to an earlier state.
Conclusions Synthetic patient trajectories simulated by the generative model can counter a lack of real world evidence of desired longitudinal timeframe while offering a strong level of privacy through a lower risk of identifying real patients. 
fig2.jpg fig4.jpg fig1.jpg fig3.jpg
Tang C. (first author). 4/14/2021. “Embedding, Aligning, and Reconstructing Clinical Notes to Explore Sepsis.” BMC Research Notes, 14, Pp. 136. Publisher's VersionAbstract
Objectives Our goal was to research and develop exploratory analysis tools for clinical notes, which now are underrepresented to limit the diversity of data insights on medically relevant applications.
Results We characterize how exploratory analysis can affect representation learning on clinical narratives and present several self-developed tools to explore sepsis. Our experiments focus on patients with sepsis in the MIMIC-III Clinical Database or in our institution’s research patient data repository. We found that global embeddings assist in learning local representations of clinical notes. Second, aligning at any specific time facilitates the use of learning models by pooling more available clinical notes to form a training set. Furthermore, reconstruction of the timeline enhances downstream-processing techniques by emphasizing temporal expressions and temporal relationships in clinical documentation. We demonstrate that clustering helps plot various types of clinical notes against a scale, which conveys a sense of the range or spread of the data and is useful for understanding data correlations. Appropriate exploratory analysis tools provide keen insights into preprocessing clinical notes, thereby further enhancing downstream analysis capabilities, making data driven medicine possible. Our examples can help generate better data representation of clinical documentation for models with improved performance and interpretability..
Tang C. (first author). 1/2/2021. “Estimating Time to Progression of Chronic Obstructive Pulmonary Disease with Tolerance.” IEEE Journal of Biomedical and Health Informatics, 1, 25, Pp. 175-180. Publisher's VersionAbstract
This paper proposes a tolerance range of the upper and lower boundaries of a preset time segment for the basic machine learning algorithms such as linear regression (LR) and support vector machines (SVMs) and investigates improvement rate (IR) on the accuracy in predicting mortality risk in patients on a corpus of clinical notes. The corpus includes pulmonary, cardiology, and radiology reports of 15,500 patients with chronic obstructive pulmonary disease who died between 2011 and 2017. Their performance is compared against a state-of-the-art long short-term memory recurrent neural network model. The results demonstrate an overall improvement by machine learning approaches when considering an optimal tolerance range: the average IR of LR is 90.1% and the maximum IR of SVMs is 66.2%. We achieved very similar results to deep learning. In addition, this paper contrasts two temporal visualizations on pulmonary notes, which consisted of representative sentences at each time segment prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model.
IEEE JBHI featured our article as the cover page article.
jbih_removed_figure_2.jpg jbih_removed_figure_4.jpg jbih_fig1fig2_using_visio.docx copd_atlas_comparison_lstmtolerage.pptx 09086080.pdf 09313846.pdf
Tang C. (first author). 12/16/2020. “Data Sovereigns for the World Economy.” Humanities and Social Sciences Communications, 7, 184. Publisher's VersionAbstract
With the rise of data capital and its instantaneous economic effects, existing data-sharing agreements have become complicated and are insufficient for capitalizing on the full value of the data resource. The challenge is to figure out how to derive benefits from data via the right to data portability. Among these, data ownership issues are complex and currently lack a concept that enables the right to data portability, is conducive to the free flow of cross-border data, and assists in the economic agglomeration of cyberspace. We propose defining the term “data sovereign” as a person or entity with the ability to possess and protect the data. First, the word “sovereign” is borrowed from the fundamental economic notion of William H. Hutt’s “consumer sovereignty.” This notion of sovereignty is strengthened by Max Weber’s classic definition of “power” – the ability to possess any resource. We envision that data capital would provide greater “cross-border” convenience for engaging in transactions and exchanges with very different cultures and societies. In our formulation, data sovereign status is achieved when one both possesses the data and can defend any attack on that data. Using “force” to protect data does not imply an abandonment of data sharing. Rather, it should be easy for an organization to enable the sharing of data and data products internally or with trusted partners. Examples of an attack on the data might be a data breach scandal, identity theft, or data terrorism. In the future, numerous tedious, time-consuming, non-artistry, manual occupational tasks can be replaced by data products that are part of a global data economy.
Tang C. (first author). 8/8/2020. “An Annotated Dataset of Tongue Images Supporting Geriatric Disease Diagnosis.” Data in Brief, 32, Pp. 106153. Publisher's VersionAbstract
Hospitalized geriatric patients are a highly heterogeneous group often with variable diseases and conditions. Physicians, and geriatricians especially, are devoted to seeking non-invasive testing tools to support a timely, accurate diagnosis. Chinese tongue diagnosis, mainly based on the color and texture of the tongue, offers a unique solution. To develop a non-invasive assessment tool using machine learning in supporting a timely, accurate diagnosis in the elderly, we created an annotated dataset of 668 tongue images collected from hospitalized geriatric patients in a tertiary hospital in Shanghai, China. Images were captured via a light-field camera using CIELAB color space (to simulate human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system. We expect that the dataset can assist in implementing a systematic means of conducting Chinese tongue diagnosis, predicting geriatric syndromes using tongue appearance, and even developing an mHealth application to provide individualized health suggestions for the elderly.
IEEE DataPort: 699+ reviews, 9 more massages
Harvard Dataverse: 2400+ downloads, 2 more massages
A Data Display: 3 samples and 1 video
Tang C. (first author). 6/6/2020. “A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates.” Annals of Data Science. Publisher's VersionAbstract
This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.
DOI: 10.1007/s40745-020-00296-8
ads_fig6(time complexity).xlsx ads_fig2.png ads_fig6.png ads_fig1.jpg 10.1007_s40745-020-00296-8.pdf
Tang C. (first author). 4/20/2020. “Following Data as it Crosses Borders During the COVID-19 Pandemic.” Journal of the American Medical Informatics Association. Publisher's VersionAbstract
Data changes the game in terms of how we respond to pandemics. Global data on disease trajectories and the effectiveness and economic impact of different social distancing measures are essential to facilitate effective local responses to pandemics. COVID-19 data flowing across geographic borders are extremely useful to public health professionals for many purposes such as accelerating the pharmaceutical development pipeline, and for making vital decisions about intensive care unit rooms, where to build temporary hospitals, or where to boost supplies of personal protection equipment, ventilators, or diagnostic tests. Sharing data enables quicker dissemination and validation of pharmaceutical innovations, as well as improved knowledge of what prevention and mitigation measures work. Even if physical borders around the globe are closed, it is crucial that data continues to transparently flow across borders to enable a data economy to thrive which will promote global public health through global cooperation and solidarity.
hms-todays_news_april_28_2020_paper_chase.pdf BWH research promotion link_JAMIA perspective ocaa063.pdf
Tang C. (coauthor). 12/23/2019. “Heterogeneous network embedding enabling accurate disease association predictions.” BMC Medical Genomics, 12, Suppl 10, Pp. 186. Publisher's VersionAbstract
Background It's significant to elucidate complex biological mechanisms of various diseases in biomedical research. Recently, the growing generation of massive volume of data in genomics, epigenomics, metagenomics, proteomics, metabolomics, nutriomics, etc., has resulted in the rise of systematic biological means of exploring complex diseases. However, the gap between the generation of the multiple data and our ability to analyze them has been broaden gradually. Furthermore, we observe that many of the aforementioned data can be represented by networks, and founded on the vector representations learned by network embedding methods, entities that are close to each other but at present do not have known direct links have high potential to be related and therefore are good candidate subjects for future biological research.
Results We integrate six public databases to construct a heterogeneous network containing three types of entities (i.e., genes, miRNAs, disease). To tackle the inherent heterogeneity, we propose a network embedding method to learn a low-dimensional vector space which best preserves the relationships between conduct disease-gene and disease-miRNA associations predictions, results of which show the superiority of our novel method over several state-of-the-arts. Furthermore, many associations predicted by our method are verified in the latest real-world dataset.
Conclusions We propose a novel heterogeneous network embedding method which can make full use of the rich contextual information and structures of heterogeneous network. We further demonstrate the effectiveness of our method in directing biological experiments, which can assist in identifying new hypotheses in biological investigation.
Tang C. (first author). 12/17/2019. “A Temporal Visualization of Chronic Obstructive Pulmonary Disease Progression Using Deep Learning and Unstructured Clinical Notes.” BMC Medical Informatics and Decision Making, 19, Suppl 8, Pp. 258. Publisher's VersionAbstract
Background Chronic obstructive pulmonary disease (COPD) is a progressive lung disease that is classified into stages based on disease severity. We aimed to characterize the time to progression prior to death in patients with COPD and to generate a temporal visualization that describes signs and symptoms during different stages of COPD progression.
Methods We present a two-step approach for visualizing COPD progression at the level of unstructured clinical notes. We included 15,500 COPD patients who both received care within Partners Healthcare’s network and died between 2011 and 2017. We first propose a four-layer deep learning model that utilizes a specially configured recurrent neural network to capture irregular time lapse segments. Using those irregular time lapse segments, we created a temporal visualization (the COPD atlas) to demonstrate COPD progression, which consisted of representative sentences at each time window prior to death based on a fraction of theme words produced by a latent Dirichlet allocation model. We evaluated our approach on an annotated corpus of COPD patients’ unstructured pulmonary, radiology, and cardiology notes.  
Results Experiments compared to the baselines showed that our proposed approach improved interpretability as well as the accuracy of estimating COPD progression.
Conclusions Our experiments demonstrated that the proposed deep-learning approach to handling temporal variation in COPD progression is feasible and can be used to generate a graphical representation of disease progression using information extracted from clinical notes.
COPD_atlas.ppsx fig1.png fig2.png fig3.png fig4.png A COPD Temporal Visualization Poster s12911-019-0984-8.pdf
Tang C. (first author). 6/28/2019. “Visualizing Literature Review Theme Evolution on Timeline Maps: Comparison Across Disciplines.” IEEE Access, 7, 1, Pp. 90597-90607. Publisher's VersionAbstract
Data-driven visualization techniques can be utilized to enhance the literature review process across different disciplines. In this work, 910 articles were retrieved using keyword search from bibliographic databases of two different disciplines (Computer Science – DBLP and Medicine – MEDLINE) between 2001 and 2016. These articles’ titles were processed using dynamic Latent Dirichlet Allocation to generate a set of themes/topics, which were subsequently classified and assigned to regions in a spatiotemporal geographical map. Resulting data visualizations from both repositories were manually reviewed by independent annotators. The results from DBLP and MEDLINE were comparable and, taken together, suggest potential benefits of increased future interaction among multidisciplinary fields. Our findings indicate that spiral timeline maps have the potential to help researchers acquire or compare knowledge efficiently without prior domain knowledge.
Tang C. (first author). 3/22/2019. “Medication Use for Childhood Pneumonia at a Children's Hospital in Shanghai, China: Analysis of Pattern Mining Algorithms.” JMIR Medical Informatics, 7, 1, Pp. e12577. Publisher's VersionAbstract
Background: Pattern mining utilizes multiple algorithms to explore objective and sometimes unexpected patterns in real-world data. This technique could be applied to electronic medical record data mining; however, it first requires a careful clinical assessment and validation.
Objective: The aim of this study was to examine the use of pattern mining techniques on a large clinical dataset to detect treatment and medication use patterns for childhood pneumonia.
Methods: We applied 3 pattern mining algorithms to 680,138 medication administration records from 30,512 childhood inpatients with diagnosis of pneumonia during a 6-year period at a children’s hospital in China. Patients’ ages ranged from 0 to 17 years, where 37.53% (11,453/30,512) were 0 to 3 months old, 86.55% (26,408/30,512) were under 5 years, 60.37% (18,419/30,512) were male, and 60.10% (18,338/30,512) had a hospital stay of 9 to 15 days. We used the FP-Growth, PrefixSpan, and USpan pattern mining algorithms. The first 2 are more traditional methods of pattern mining and mine a complete set of frequent medication use patterns. PrefixSpan also incorporates an administration sequence. The newer USpan method considers medication utility, defined by the dose, frequency, and timing of use of the 652 individual medications in the dataset. Together, these 3 methods identified the top 10 patterns from 6 age groups, forming a total of 180 distinct medication combinations. These medications encompassed the top 40 (73.66%, 500,982/680,138) most frequently used medications. These patterns were then evaluated by subject matter experts to summarize 5 medication use and 2 treatment patterns.
Results: We identified 5 medication use patterns: (1) antiasthmatics and expectorants and corticosteroids, (2) antibiotics and (antiasthmatics or expectorants or corticosteroids), (3) third-generation cephalosporin antibiotics with (or followed by) traditional antibiotics, (4) antibiotics and (medications for enteritis or skin diseases), and (5) (antiasthmatics or expectorants or corticosteroids) and (medications for enteritis or skin diseases). We also identified 2 frequent treatment patterns: (1) 42.89% (291,701/680,138) of specific medication administration records were of intravenous therapy with antibiotics, diluents, and nutritional supplements and (2) 11.53% (78,390/680,138) were of various combinations of inhalation of antiasthmatics, expectorants, or corticosteroids. Fleiss kappa for the subject experts’ evaluation was 0.693, indicating moderate agreement.
Conclusions: Utilizing a pattern mining approach, we summarized 5 medication use patterns and 2 treatment patterns. These warrant further investigation.
medication_pattern_slices.ppt fig_2.png fig_3.png tangelalpdf.pdf
Tang C. (first author). 11/22/2018. “Rethinking Data Sharing at the Dawn of a Health Data Economy: A Viewpoint.” J Med Internet Res, 20, 11, Pp. e11519. Publisher's VersionAbstract
A healthcare data economy has begun to form, but its rise has been tempered by the profound lack of sharing of both data and data products such as models, intermediate results, and annotated training corpses, and this severely limits the potential for triggering economic cluster effects. Economic cluster effects represent a means to elicit benefit from economies of scale from internal data innovations and are beneficial because they may mitigate challenges from external sources. Within institutions, data product sharing is needed to spark data entrepreneurship and data innovation, and cross-institutional sharing is also critical especially for rare conditions.
Tang C. (first author). 8/23/2017. “Comment Topic Evolution on a Cancer Institution's Facebook Page.” Appl Clin Inform., 8, 3, Pp. 854-865. Publisher's VersionAbstract
Objectives: Our goal was to identify and track the evolution of the topics discussed in free-text comments on a cancer institution’s social media page.
Methods: We utilized the Latent Dirichlet Allocation model to extract ten topics from free-text comments on a cancer research institution’s Facebook™ page between January 1, 2009, and June 30, 2014. We calculated Pearson correlation coefficients between the comment categories to demonstrate topic intensity evolution.
Results: A total of 4,335 comments were included in this study, from which ten topics were identified: greetings (17.3%), comments about the cancer institution (16.7%), blessings (10.9%), time (10.7%), treatment (9.3%), expressions of optimism (7.9%), tumor (7.5%), father figure (6.3%), and other family members & friends (8.2%), leaving 5.1% of comments unclassified. The comment distributions reveal an overall increasing trend during the study period. We discovered a strong positive correlation between greetings and other family members & friends (r=0.88; p<0.001), a positive correlation between blessings and the cancer institution (r=0.65; p<0.05), and a negative correlation between blessings and greetings (r=–0.70; p<0.05).
Conclusions: A cancer institution’s social media platform can provide emotional support to patients and family members. Topic analysis may help institutions better identify and support the needs (emotional, instrumental, and social) of their community and influence their social media strategy.

This paper was selected as one of the 15 candidate best papers (among 32,958 papers) in the cancer informatics section of the 2018 IMIA (International Medical Informatics Association) Yearbook.

2018 IMIA Yearbook_Cancer Informatics 10-4338-aci-2017-04-ra-0055.pdf
Tang C. (first author). 2012. “Similarity Query of Time Series Sub-Sequences Based on LSH.” Jisuanji Xuebao (Chinese Journal of Computers), 11, 35, Pp. 2228-2236. Publisher's VersionAbstract
Subsequence similarity query is an important operation in time series, including range query and k nearest neighbor query. Most of these algorithms are based on the Euclidean distance or DTW distance, weak point of which is the time inefficiencies. We propose a new distance measure based on locality sensitive hash (LSH), which improve the efficiency greatly while ensuring the quality of the query results. We also propose an index structure named DS-Index. Using DS-Index, we prune the candidates of query and thus propose two optimal algorithms: OLSH-Range and OLSH-kNN. Our experiments conducted on real stock exchange transaction sequence datasets show that algorithms can quickly and accurately find similarity query results.

This paper awaraed Sa, Shixuan Best Student Paper at the 29th National Database Conference of China in 2012.