# Publications

2016
Lin C, Miller T, Dligach D, Bethard S, Savova G. Improving Temporal Relation Extraction with Training Instance Augmentation, in Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Berlin, Germany: Association for Computational Linguistics ; 2016 :108–113. Publisher's Version
Miller T, Dligach D, Savova G. Unsupervised Document Classification with Informed Topic Models, in Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Berlin, Germany: Association for Computational Linguistics ; 2016 :83–91. Publisher's Version
Shain C, Bryce W, Jin L, Krakovna V, Doshi-Velez F, Miller T, Schuler W, Schwartz L. Memory-Bounded Left-Corner Unsupervised Grammar Induction on Child-Directed Input, in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan: The COLING 2016 Organizing Committee ; 2016 :964–975. Publisher's VersionAbstract
This paper presents a new memory-bounded left-corner parsing model for unsupervised raw-text syntax induction, using unsupervised hierarchical hidden Markov models (UHHMM). We deploy this algorithm to shed light on the extent to which human language learners can discover hierarchical syntax through distributional statistics alone, by modeling two widely-accepted features of human language acquisition and sentence processing that have not been simultaneously modeled by any existing grammar induction algorithm: (1) a left-corner parsing strategy and (2) limited working memory capacity. To model realistic input to human language learners, we evaluate our system on a corpus of child-directed speech rather than typical newswire corpora. Results beat or closely match those of three competing systems.
2015
Miller TA, Bethard S, Dligach D, Lin C, Savova GK. Extracting Time Expressions from Clinical Text, in Proceedings of the 2015 Workshop on Biomedical Natural Language Processing (BioNLP 2015)Workshop on Biomedical Natural Language Processing. ; 2015. Publisher's Version
Dligach D, Miller T, Savova GK. Semi-supervised Learning for Phenotyping Tasks, in AMIA Annual Symposium Proceedings. ; 2015. Publisher's Version
2014
Lin C, Karlson EW, Dligach D, Ramirez MP, a. Miller T, Mo H, Braggs NS, a. Cagan, Gainer V, Denny JC, et al. Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. Journal of the American Medical Informatics Association. 2014 :23–30. Publisher's Version
Lin C, Miller T, Kho A, Bethard S, Dligach D, Pradhan S, Savova G. Descending-Path Convolution Kernel for Syntactic Structures. Acl. 2014;1 :81–86.Abstract
Convolution tree kernels are an efficient and effective method for comparing syntac- tic structures in NLP methods. However, current kernel methods such as subset tree kernel and partial tree kernel understate the similarity of very similar tree structures. Although soft-matching approaches can im- prove the similarity scores, they are corpus- dependent and match relaxations may be task-specific. We propose an alternative ap- proach called descending path kernel which gives intuitive similarity scores on compa- rable structures. This method is evaluated on two temporal relation extraction tasks and demonstrates its advantage over rich syntactic representations.
Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, Clark C. Negation's Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing. PLoS ONE. 2014;9 (11) :e112774. Publisher's Version
Styler IV WF, Bethard S, Finan S, Palmer M, Pradhan SS, de Groen PC, Erickson B, Miller TA, Lin C, Savova GK, et al. Temporal annotation in the clinical domain. Transactions of the łdots}. 2014;2 :143–154. Publisher's Version
2013
Lin C, Karlson EW, Canhao H, Miller TA, Dligach D, Chen PJ, Perez RNG, Shen Y, Weinblatt ME, Shadick NA, et al. Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records. PloS one. 2013;8 (8) :e69932. Publisher's VersionAbstract
OBJECTIVE: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record. MATERIALS AND METHODS: The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values. RESULTS: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 ($\sigma$ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, $\sigma$ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers. CONCLUSION: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies.
Miller T, Bethard S, Dligach D, Pradhan S, Lin C, Savova G. Discovering Temporal Narrative Containers in Clinical Text, in Proceedings of the 2013 Workshop on Biomedical Natural Language Processing. Sofia, Bulgaria: Association for Computational Linguistics ; 2013 :18–26. Publisher's Version
Miller T, Bethard S, Dligach D, Pradhan S, Lin C, Savova G. Discovering Temporal Narrative Containers in Clinical Text. Proceedings of the 2013 Workshop on Biomedical Natural Language Processing. 2013;(BioNLP) :18–26. Publisher's Version
2012
Miller TA, Dligach D, Savova GK, Ave L. Active Learning for Coreference Resolution, in Proceedings of BioNLP 2012. ; 2012 :73–81.
Polepalli Ramesh B, Prasad R, Miller T, Harrington B, Yu H. Automatic discourse connective detection in biomedical text. Journal of the American Medical Informatics Association. 2012;19 (5) :800–808.Abstract
OBJECTIVE: Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text.$\backslash$n$\backslash$nMATERIALS AND METHODS: Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (\~112,000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (\~1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores.$\backslash$n$\backslash$nRESULTS AND CONCLUSION: Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org.
Zheng J, Chapman WW, Miller TA, Lin C, Crowley RS, Savova GK. A system for coreference resolution for the clinical narrative. Journal of the American Medical Informatics Association. 2012;19 (4) :660–667.
Lin C, Canhao H, Miller TA, Dligach D, Plenge RM, Karlson EW, Savova GK. Feature engineering and selection for rheumatoid arthritis disease activity classification using electronic medical records. Proceedings of the 29th international ICML conference Workshop on Machine Learning for Clinical Data. 2012. Publisher's Version
2011
Li D, Miller T, Schuler W. A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT '11). ; 2011 :1169–1178. Publisher's VersionAbstract
This paper presents a supervised pronoun anaphora resolution system based on factorial hidden Markov models (FHMMs). The basic idea is that the hidden states of FHMMs are an explicit short-term memory with an antecedent buffer containing recently described referents. Thus an observed pronoun can find its antecedent from the hidden buffer, or in terms of a generative model, the entries in the hidden buffer generate the corresponding pronouns. A system implementing this model is evaluated on the ACE corpus with promising performance.
2010
Schuler W, AbdelRahman S, Miller T, Schwartz L. Broad-Coverage Incremental Parsing using Human-Like Memory Constraints. Computational Linguistics. 2010;36 (1) :1–30.
Schuler W, AbdelRahman S, Miller T, Schwartz L. Broad-Coverage Parsing Using Human-Like Memory Constraints. Computational Linguistics. 2010;36 (1) :1–30.Abstract
Human syntactic processing shows many signs of taking place within a general-purpose short-term memory. But this kind of memory is known to have a severely constrained storage capacity—possibly constrained to as few as three or four distinct elements. This article describes a model of syntactic processing that operates successfully within these severe constraints, by recognizing constituents in a right-corner transformed representation (a variant of left-corner parsing) and mapping this representation to random variables in a Hierarchic Hidden Markov Model, a factored time-series model which probabilistically models the contents of a bounded memory store over time. Evaluations of the coverage of this model on a large syntactically annotated corpus of English sentences, and the accuracy of a a bounded-memory parsing strategy based on this model, suggest this model may be cognitively plausible.
Miller T. Generative Models of Disfluency. 2010.