Elucidating biological mechanisms underlying complex diseases is an important goal in biomedical research. Recent advances in biological technology have enabled the generation of massive volume of data in genomics, transcriptomics, proteomics, epigenomics, metagenomics, metabolomics, nutriomics, etc., leading to the emergence of systems biology approach to investigating complex diseases. However, most of the data remain underutilized after their initial acquisition and analysis. There is a growing gap between the generation of the multifaceted data and our ability to integrate and analyze them. Inspired by the observation that many of the aforementioned data can be represented by networks, we propose a networkbased model to encapsulate the rich information provided in each database and to connect across different databases. We integrate several public databases to construct a heterogeneous network in which nodes are entities such as genes, miRNAs, diseases, and edges represent known relationships between them. One fundamental challenge is how to perform meaningful analysis on such network, overcoming the intrinsic heterogeneity. We propose a network embedding method to learn a low-dimensional vector space that best preserves the known relationships between entities. Based on the learned vector representations, entities that are close to each other but currently do not have known direct connections, are likely to have an association and therefore are good candidates for future investigation. In the experiments, we construct a heterogeneous network of genes, miRNAs and diseases using data from six public databases. To evaluate the performance of the proposed method, we predict disease-gene and disease-miRNA associations. Comparison of our novel method with several state-of-the-art methods clearly demonstrates the advantage of our method, as it is the only one that takes full advantage of the rich contextual information provided by the heterogeneous network. The encouraging results suggest that our method can provide help in identifying new hypotheses to guide future research.
A healthcare data economy has begun to form, but its rise has been tempered by the profound lack of sharing of both data and data products such as models, intermediate results, and annotated training corpses, and this severely limits the potential for triggering economic cluster effects. Economic cluster effects represent a means to elicit benefit from economies of scale from internal data innovations and are beneficial because they may mitigate challenges from external sources. Within institutions, data product sharing is needed to spark data entrepreneurship and data innovation, and cross-institutional sharing is also critical especially for rare conditions.
Research clues can be expressed as coherent chains of keywords grouped by theme. Capturing clues to research from the vast and expanding medical literature is valuable. Yet, it is difficult to automatically create clear visualizations of research clues despite the presence of many competing summarization tools. In this paper, we propose a linear classifier based on a spiral, which we call a regional classifier. The study emphasizes the development of visualization methods and the process of finding a specific research clue to track patient needs reported in medical literature. When timelines are combined with a spiral geographical map, they show a geometric shape that helps to reveal the clues from different spatial viewpoints and periodical constraints. Our evaluation showed that the regional classifier produces better visual effects than support vector machine classifiers. It covers important concepts of each theme and is able to represent the relationships among papers in a way that captures continuous developments and changes in key themes.
Objectives: Our goal was to identify and track the evolution of the topics discussed in free-text comments on a cancer institution’s social media page.
Methods: We utilized the Latent Dirichlet Allocation model to extract ten topics from free-text comments on a cancer research institution’s Facebook™ page between January 1, 2009, and June 30, 2014. We calculated Pearson correlation coefficients between the comment categories to demonstrate topic intensity evolution.
Results: A total of 4,335 comments were included in this study, from which ten topics were identified: greetings (17.3%), comments about the cancer institution (16.7%), blessings (10.9%), time (10.7%), treatment (9.3%), expressions of optimism (7.9%), tumor (7.5%), father figure (6.3%), and other family members & friends (8.2%), leaving 5.1% of comments unclassified. The comment distributions reveal an overall increasing trend during the study period. We discovered a strong positive correlation between greetings and other family members & friends (r=0.88; p<0.001), a positive correlation between blessings and the cancer institution (r=0.65; p<0.05), and a negative correlation between blessings and greetings (r=–0.70; p<0.05).
Conclusions: A cancer institution’s social media platform can provide emotional support to patients and family members. Topic analysis may help institutions better identify and support the needs (emotional, instrumental, and social) of their community and influence their social media strategy.
This paper was selected as one of the 15 candidate best papers (among 32,958 papers) in the cancer informatics section of the 2018 IMIA (International Medical Informatics Association) Yearbook.
Provides an introduction of the data industry to the field of economics
This book bridges the gap between economics and data science to help data scientists understand the economics of big data, and enable economists to analyze the data industry. It begins by explaining data resources and introduces the data asset. This book defines a data industry chain, enumerates data enterprises’ business models versus operating models, and proposes a mode of industrial development for the data industry. The author describes five types of enterprise agglomerations, and multiple industrial cluster effects. A discussion on the establishment and development of data industry related laws and regulations is provided. In addition, this book discusses several scenarios on how to convert data driving forces into productivity that can then serve society. This book is designed to serve as a reference and training guide for ata scientists, data-oriented managers and executives, entrepreneurs, scholars, and government employees.This book bridges the gap between economics and data science to help data scientists understand the economics of big data, and enable economists to analyze the data industry. It begins by explaining data resources and introduces the data asset. This book defines a data industry chain, enumerates data enterprises’ business models versus operating models, and proposes a mode of industrial development for the data industry. The author describes five types of enterprise agglomerations, and multiple industrial cluster effects. A discussion on the establishment and development of data industry related laws and regulations is provided. In addition, this book discusses several scenarios on how to convert data driving forces into productivity that can then serve society. This book is designed to serve as a reference and training guide for ata scientists, data-oriented managers and executives, entrepreneurs, scholars, and government employees.
Defines and develops the concept of a “Data Industry,” and explains the economics of data to data scientists and statisticians
Includes numerous case studies and examples from a variety of industries and disciplines
Serves as a useful guide for practitioners and entrepreneurs in the business of data technology
The Data Industry: The Business and Economics of Information and Big Data is a resource for practitioners in the data science industry, government, and students in economics, business, and statistics.
Two recent articles (see the link1, link2) published in authoritative journal The Economist, which agreed with all the opinions in this book.
Subsequence similarity query is an important operation in time series, including range query and k nearest neighbor query. Most of these algorithms are based on the Euclidean distance or DTW distance, weak point of which is the time inefficiencies. We propose a new distance measure based on locality sensitive hash (LSH), which improve the efficiency greatly while ensuring the quality of the query results. We also propose an index structure named DS-Index. Using DS-Index, we prune the candidates of query and thus propose two optimal algorithms: OLSH-Range and OLSH-kNN. Our experiments conducted on real stock exchange transaction sequence datasets show that algorithms can quickly and accurately find similarity query results.
Hepatocellular carcinoma (HCC) is a common cancer with a high mortality rate. The complete pathogenesis of HCC is not completely understood, and highly efficient therapy is still unavailable. In the past several decades, various genetic variations such as mutations and polymorphisms have been reported to be associated with HCC risk, progression, survival, and recurrence. However, to our knowledge, these genetic variations have not been comprehensively and systematically compiled. In this study we constructed dbHCCvar, a free online database of human genetic variations in HCC. Eligible publications were collected from PubMed, and detailed information and major research data from each eligible study were then extracted and recorded in our database. As a result, dbHCCvar contains almost all human genetic variations reported to be associated or not associated with HCC risk, clinical pathology, drug reaction, survival, or recurrence to date. It is expected that dbHCCvar will function as a useful tool for researchers to facilitate the search and identification of new genetic markers for HCC. dbHCCvar is free for all visitors at http://GenetMed.fudan.edu.cn/dbHCCvar.