Machine Learning

2018
J. P. Cortés, et al., “Ambulatory assessment of phonotraumatic vocal hyperfunction using glottal airflow measures estimated from neck-surface acceleration,” PLoS One, vol. 13, no. 12, pp. e0209017, 2018. Publisher's VersionAbstract
Phonotraumatic vocal hyperfunction (PVH) is associated with chronic misuse and/or abuse of voice that can result in lesions such as vocalfold nodules. The clinical aerodynamic assessment of vocal function has been recently shown to differentiate between patients with PVH and healthy controls to provide meaningful insight into pathophysiological mechanisms associated with these disorders. However, all current clinical assessment of PVH is incomplete because of its inability to objectively identify the type and extent of detrimental phonatory function that is associated with PVH during daily voice use. The current study sought to address this issue by incorporating, for the first time in a comprehensive ambulatory assessment, glottal airflow parameters estimated from a neck-mounted accelerometer and recorded to a smartphone-based voice monitor. We tested this approach on 48 patients with vocal fold nodules and 48 matched healthy-control subjects who each wore the voice monitor for a week. Seven glottal airflow features were estimated every 50 ms using an impedance-based inverse filtering scheme, and seven high-order summary statistics of each feature were computed every 5 minutes over voiced segments. Based on a univariate hypothesis testing, eight glottal airflow summary statistics were found to be statistically different between patient and healthy-control groups. L1-regularized logistic regression for a supervised classification task yielded a mean (standard deviation) area under the ROC curve of 0.82 (0.25) and an accuracy of 0.83 (0.14). These results outperform the state-of-the-art classification for the same classification task and provide a new avenue to improve the assessment and treatment of hyperfunctional voice disorders.
Paper
2017
T. F. Quatieri, et al., “Multimodal biomarkers to discriminate cognitive state,” in The Role of Technology in Clinical Neuropsychology, R. L. Kane and T. D. Parson, Ed. Oxford University Press, 2017, pp. 409–443.
M. Borsky, D. D. Mehta, J. H. Van Stan, and J. Gudnason, “Modal and nonmodal voice quality classification using acoustic and electroglottographic features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2281-2291, 2017. Publisher's VersionAbstract
The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonic model features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality from either of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities of modal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions from the acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
Paper
J. H. Van Stan, D. D. Mehta, and R. E. Hillman, “Recent innovations in voice assessment expected to impact the clinical management of voice disorders,” Perspectives of the ASHA Special Interest Groups, vol. 2, no. SIG 3, pp. 4-13, 2017. Publisher's VersionAbstract

This article provides a summary of some recent innovations in voice assessment expected to have an impact in the next 5–10 years on how patients with voice disorders are clinically managed by speech-language pathologists. Specific innovations discussed are in the areas of laryngeal imaging, ambulatory voice monitoring, and “big data” analysis using machine learning to produce new metrics for vocal health. Also discussed is the potential for using voice analysis to detect and monitor other health conditions.

Paper
M. Borsky, M. Cocude, D. D. Mehta, M. Zañartu, and J. Gudnason, “Classification of voice modes using neck-surface accelerometer data,” in International Conference on Acoustics, Speech, and Signal Processing, 2017.Abstract

 

This study analyzes signals recorded using a neck-surface accelerometer from subjects producing speech with different voice modes. The purpose is to explore if the recorded waveforms can capture the glottal vibratory patterns which can be related to the movement of the vocal folds and thus voice quality. The accelerometer waveforms do not contain the supraglottal resonances, and these characteristics make the proposed method suitable for real-life voice quality assessment and monitoring as it does not breach patient privacy. The experiments with a Gaussian mexture model classifier demonstrate that different voice qualities produce distinctly different accelerometer waveforms. The system achieved 80.2% and 89.5% for frame- and utterance-level accuracy, respectively, for classifying among modal, breathy, pressed, and rough voice modes using a speaker-dependent classifier. Finally, the article presents characteristic waveforms for each modality and discusses their attributes.

 

Paper
2016
M. Borsky, D. D. Mehta, J. P. Gudjohnsen, and J. Gudnason, “Classification of voice modality using electroglottogram waveforms,” in INTERSPEECH, 2016.Abstract

 

It has been proven that the improper function of the vocal folds can result in perceptually distorted speech that is typically identified with various speech pathologies or even some neurological diseases. As a consequence, researchers have focused on finding quantitative voice characteristics to objectively assess and automatically detect non-modal voice types. The bulk of the research has focused on classifying the speech modality by using the features extracted from the speech signal. This paper proposes a different approach that focuses on analyzing the signal characteristics of the electroglottogram (EGG) waveform. The core idea is that modal and different kinds of non-modal voice types produce EGG signals that have distinct spectral/cepstral characteristics. As a consequence, they can be distinguished from each other by using standard cepstral-based features and a simple multivariate Gaussian mixture model. The practical usability of this approach has been verified in the task of classifying among modal, breathy, rough, pressed and soft voice types. We have achieved 83% frame-level accuracy and 91% utterance-level accuracy by training a speaker-dependent system.

 

Paper
C. E. Stepp, M. Zañartu, D. D. Mehta, and R. E. Hillman, “Hyperfunctional voice disorders: Current results, clinical implications, and future directions of a multidisciplinary research program,” Proceedings of the Annual Convention of the American Speech-Language-Hearing Association, 2016.
R. E. Hillman, D. Mehta, C. Stepp, J. Van Stan, and M. Zanartu, “Objective assessment of vocal hyperfunction,” Proceedings of The Journal of the Acoustical Society of America, vol. 139, pp. 2193-2194, 2016.
M. Ghassemi, Z. Syed, D. D. Mehta, J. H. Van Stan, R. E. Hillman, and J. V. Guttag, “Uncovering voice misuse using symbolic mismatch,” JMLR (Journal of Machine Learning Research): Workshop and Conference Proceedings, pp. 1-14, 2016. Paper
2015
J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta, “Segment-dependent dynamics in predicting Parkinson’s disease,” Proceedings of InterSpeech, pp. 518-522, 2015. Paper
D. D. Mehta, et al., “Using ambulatory voice monitoring to investigate common voice disorders: Research update,” Frontiers in Bioengineering and Biotechnology, vol. 3, no. 155, pp. 1-14, 2015. Publisher's VersionAbstract

Many common voice disorders are chronic or recurring conditions that are likely to result from inefficient and/or abusive patterns of vocal behavior, referred to as vocal hyperfunction. The clinical management of hyperfunctional voice disorders would be greatly enhanced by the ability to monitor and quantify detrimental vocal behaviors during an individual’s activities of daily life. This paper provides an update on ongoing work that uses a miniature accelerometer on the neck surface below the larynx to collect a large set of ambulatory data on patients with hyperfunctional voice disorders (before and after treatment) and matched-control subjects. Three types of analysis approaches are being employed in an effort to identify the best set of measures for differentiating among hyperfunctional and normal patterns of vocal behavior: (1) ambulatory measures of voice use that include vocal dose and voice quality correlates, (2) aerodynamic measures based on glottal airflow estimates extracted from the accelerometer signal using subject-specific vocal system models, and (3) classification based on machine learning and pattern recognition approaches that have been used successfully in analyzing long-term recordings of other physiological signals. Preliminary results demonstrate the potential for ambulatory voice monitoring to improve the diagnosis and treatment of common hyperfunctional voice disorders.

Paper
T. F. Quatieri, et al., “Vocal biomarkers to discriminate cognitive load in a working memory task,” Proceedings of InterSpeech, pp. 2684-2688, 2015. Paper
2014
M. Ghassemi, et al., “Learning to detect vocal hyperfunction from ambulatory neck-surface acceleration features: Initial results for vocal fold nodules,” IEEE Transactions on Biomedical Engineering, vol. 61, no. 6, pp. 1668-1675, 2014. Publisher's VersionAbstract

Voice disorders are medical conditions that often result from vocal abuse/misuse which is referred to generically as vocal hyperfunction. Standard voice assessment approaches cannot accurately determine the actual nature, prevalence, and pathological impact of hyperfunctional vocal behaviors because such behaviors can vary greatly across the course of an individual's typical day and may not be clearly demonstrated during a brief clinical encounter. Thus, it would be clinically valuable to develop noninvasive ambulatory measures that can reliably differentiate vocal hyperfunction from normal patterns of vocal behavior. As an initial step toward this goal we used an accelerometer taped to the neck surface to provide a continuous, noninvasive acceleration signal designed to capture some aspects of vocal behavior related to vocal cord nodules, a common manifestation of vocal hyperfunction. We gathered data from 12 female adult patients diagnosed with vocal fold nodules and 12 control speakers matched for age and occupation. We derived features from weeklong neck-surface acceleration recordings by using distributions of sound pressure level and fundamental frequency over 5-min windows of the acceleration signal and normalized these features so that intersubject comparisons were meaningful. We then used supervised machine learning to show that the two groups exhibit distinct vocal behaviors that can be detected using the acceleration signal. We were able to correctly classify 22 of the 24 subjects, suggesting that in the future measures of the acceleration signal could be used to detect patients with the types of aberrant vocal behaviors that are associated with hyperfunctional voice disorders.

Paper
J. R. Williamson, T. F. Quatieri, B. S. Helfer, G. Ciccarelli, and D. D. Mehta, “Vocal and facial biomarkers of depression based on motor incoordination and timing,” Proceedings of the Fourth International Audio/Visual Emotion Challenge (AVEC 2014), 22nd ACM International Conference on Multimedia, pp. 65-72, 2014. Paper
2013
J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. L. HORWITZ, B. Yu, and D. D. Mehta, “Vocal and facial biomarkers of depression based on motor incoordination,” Third International Audio/Visual Emotion Challenge (AVEC 2013), 21st ACM International Conference on Multimedia. pp. 1-4, 2013. Paper
2012
M. Ghassemi, et al., “Detecting voice modes for vocal hyperfunction prevention,” Proceedings of the 7th Annual Workshop for Women in Machine Learning. 2012.
2011
R. E. Hillman and D. D. Mehta, “Ambulatory monitoring of daily voice use,” Perspectives on Voice and Voice Disorders, vol. 21, no. 2, pp. 56-61, 2011. Publisher's Version Paper