In this paper, we investigate scientific misconceptions concerning epidemiological research to explore convergence in big data epidemiology. To achieve this aim, we specifically detect increased concerns by adopting a topic model we previously designed for an exploratory literature review instead of a full manual literature review. In the latter of this study, we first distinguish between misconceptions that are easy to dispel (because they represent simple errors) and misconceptions that are, on the contrary, tough to dispel (because they are the product of pseudo explanations). We next propose spatiotemporal thinking as a fruitful means to deal with misconceptions. We define the core competencies and knowledge relevant to the practice of spatiotemporal thinking and discuss how they help us avoid misconceptions when converging big data and epidemiology.
Learning representations of clinical notes poses challenges in handling complex content that necessitates preprocessing steps to make the data more suitable for data mining. An important issue, addressed here, is that of temporal expressions, where cues indicate the time when clinical events occur. We present a three-step data reconstruction algorithm for transforming similar clinical entities (e.g., symptoms, complications) into sequential data through unsupervised annotation of temporal expressions. First, the data reconstruction algorithm detects if an expression has temporal intent. Second, it decomposes and rewrites the expression into non-temporal sub-expression and temporal constraints. Finally, it clusters similar non-temporal sub-expressions by using unsupervised sentence embedding under the modified K-medoids paradigm. We experimented with our proposed algorithm on clinical notes associated with chronic obstructive pulmonary disease (COPD). Visualizing reconstruction results of cardiology reports for a longitudinal cohort of patients with COPD demonstrated that this algorithm is feasible.
Chronic Obstructive Pulmonary Disease (COPD) is a leading cause of mortality in the United States. Representing COPD progression using temporal graphs may offer critical clinical insights. Long-Short Term Memory units in recurrent neural networks can process data with constant elapsed times between consecutive elements of a sequence but cannot handle irregular time intervals (i.e., segments with unequal-time). In this study, we propose a four-layer deep learning model that utilizes a specially configured recurrent neural network to capture irregular time lapse segments. Experiments on a corpus of COPD patients’ clinical notes compared to baseline algorithms showed that our model improved interpretability as well as the accuracy of estimating COPD progression.
Illustration of all three types of clinical notes in COPD patient (Fig. 4@Tableau).
Elucidating biological mechanisms underlying complex diseases is an important goal in biomedical research. Recent advances in biological technology have enabled the generation of massive volume of data in genomics, transcriptomics, proteomics, epigenomics, metagenomics, metabolomics, nutriomics, etc., leading to the emergence of systems biology approach to investigating complex diseases. However, most of the data remain underutilized after their initial acquisition and analysis. There is a growing gap between the generation of the multifaceted data and our ability to integrate and analyze them. Inspired by the observation that many of the aforementioned data can be represented by networks, we propose a networkbased model to encapsulate the rich information provided in each database and to connect across different databases. We integrate several public databases to construct a heterogeneous network in which nodes are entities such as genes, miRNAs, diseases, and edges represent known relationships between them. One fundamental challenge is how to perform meaningful analysis on such network, overcoming the intrinsic heterogeneity. We propose a network embedding method to learn a low-dimensional vector space that best preserves the known relationships between entities. Based on the learned vector representations, entities that are close to each other but currently do not have known direct connections, are likely to have an association and therefore are good candidates for future investigation. In the experiments, we construct a heterogeneous network of genes, miRNAs and diseases using data from six public databases. To evaluate the performance of the proposed method, we predict disease-gene and disease-miRNA associations. Comparison of our novel method with several state-of-the-art methods clearly demonstrates the advantage of our method, as it is the only one that takes full advantage of the rich contextual information provided by the heterogeneous network. The encouraging results suggest that our method can provide help in identifying new hypotheses to guide future research.
Research clues can be expressed as coherent chains of keywords grouped by theme. Capturing clues to research from the vast and expanding medical literature is valuable. Yet, it is difficult to automatically create clear visualizations of research clues despite the presence of many competing summarization tools. In this paper, we propose a linear classifier based on a spiral, which we call a regional classifier. The study emphasizes the development of visualization methods and the process of finding a specific research clue to track patient needs reported in medical literature. When timelines are combined with a spiral geographical map, they show a geometric shape that helps to reveal the clues from different spatial viewpoints and periodical constraints. Our evaluation showed that the regional classifier produces better visual effects than support vector machine classifiers. It covers important concepts of each theme and is able to represent the relationships among papers in a way that captures continuous developments and changes in key themes.