Research Investigations (Peer-Reviewed)

In Press
J. H. Van Stan, et al., “Differences in weeklong ambulatory vocal behavior between female patients with phonotraumatic lesions and matched controls,” Journal of Speech, Language, and Hearing Research, In Press.
J. Z. Lin, V. M. Espinoza, M. Zañartu, K. L. Marks, and D. D. Mehta, “Improved subglottal pressure estimation from neck-surface vibration in healthy speakers producing non-modal phonation,” IEEE Journal of Special Topics in Signal Processing, In Press. Publisher's VersionAbstract
Subglottal air pressure plays a major role in voice production and is a primary factor in controlling voice onset, offset, sound pressure level, glottal airflow, vocal fold collision pressures, and variations in fundamental frequency. Previous work has shown promise for the estimation of subglottal pressure from an unobtrusive miniature accelerometer sensor attached to the anterior base of the neck during typical modal voice production across multiple pitch and vowel contexts. This study expands on that work to incorporate additional accelerometer-based measures of vocal function to compensate for non-modal phonation characteristics and achieve an improved estimation of subglottal pressure. Subjects with normal voices repeated /p/-vowel syllable strings from loud-to-soft levels in multiple vowel contexts (/a/, /i/, and /u/), pitch conditions (comfortable, lower than comfortable, higher than comfortable), and voice quality types (modal, breathy, strained, and rough). Subject-specific, stepwise regression models were constructed using root-mean-square (RMS) values of the accelerometer signal alone (baseline condition) and in combination with cepstral peak prominence, fundamental frequency, and glottal airflow measures derived using subglottal impedance-based inverse filtering. Five-fold cross-validation assessed the robustness of model performance using the root-mean-square error metric for each regression model. Each cross-validation fold exhibited up to a 25% decrease in prediction error when the model incorporated multi-dimensional aspects of the accelerometer signal compared with RMS-only models. Improved estimation of subglottal pressure for non-modal phonation was thus achievable, lending to future studies of subglottal pressure estimation in patients with voice disorders and in ambulatory voice recordings.
M. Brockmann-Bauser, J. H. Van Stan, M. Carvalho Sampaio, J. E. Bohlender, R. E. Hillman, and D. D. Mehta, “Effects of vocal intensity and fundamental frequency on cepstral peak prominence in patients with voice disorders and vocally healthy controls,” Journal of Voice, In Press.
J. T. Heaton, et al., “Aerodynamically driven phonation of individual vocal folds under general anesthesia in canines,” The Laryngoscope, In Press. Publisher's VersionAbstract


We previously developed an instrument called the Aerodynamic Vocal Fold Driver (AVFD) for intraoperative magnified assessment of vocal fold (VF) vibration during microlaryngoscopy under general anesthesia. Excised larynx testing showed that the AVFD could provide useful information about the vibratory characteristics of each VF independently. The present investigation expands those findings by testing new iterations of the AVFD during microlaryngoscopy in the canine model.

Study Design

Animal model.


The AVFD is a handheld instrument that is positioned to contact the phonatory mucosa of either VF during microlaryngoscopy. Airflow delivered through the AVFD shaft to the subglottis drives the VF into phonation‐like vibration, which enables magnified observation of mucosal‐wave function with stroboscopy or high‐speed video. AVFD‐driven phonation was tested intraoperatively (n = 26 VFs) using either the original instrument design or smaller and larger versions three‐dimensionally printed from a medical grade polymer. A high‐fidelity pressure sensor embedded within the AVFD measured VF contact pressure. Characteristics of individual VF phonation were compared with typical two‐fold phonation and compared for VFs scarred by electrocautery (n = 4) versus controls (n = 22).


Phonation was successful in all 26 VFs, even when scar prevented conventional bilateral phonation. The 15‐mm‐wide AVFD fits best within the anteroposterior dimension of the musculo‐membranous VF, and VF contact pressure correlated with acoustic output, driving pressures, and visible modes of vibration.


The AVFD can reveal magnified vibratory characteristics of individual VFs during microlaryngoscopy (e.g., without needing patient participation), potentially providing information that is not apparent or available during conventional awake phonation, which might facilitate phonosurgical decision making.

Level of Evidence


D. D. Deliyski, et al., “Laser-calibrated system for transnasal fiberoptic laryngeal high-speed videoendoscopy,” Journal of Voice, In Press. Publisher's Version
H. Ghasemzadeh, D. D. Deliyski, D. S. Ford, J. B. Kobler, R. E. Hillman, and D. D. Mehta, “Method for vertical calibration of laser-projection transnasal fiberoptic high-speed videoendoscopy,” Journal of Voice, In Press. Publisher's VersionAbstract
The ability to provide absolute calibrated measurement of the laryngeal structures during phonation is of paramount importance to voice science and clinical practice. Calibrated three-dimensional measurement could provide essential information for modeling purposes, for studying the developmental aspects of vocal fold vibration, for refining functional voice assessment and treatment outcomes evaluation, and for more accurate staging and grading of laryngeal disease. Recently, a laser-calibrated transnasal fiberoptic endoscope compatible with high-speed videoendoscopy (HSV) and capable of providing three-dimensional measurements was developed. The optical principle employed is to project a grid of 7 × 7 green laser points across the field of view (FOV) at an angle relative to the imaging axis, such that (after calibration) the position of each laser point within the FOV encodes the vertical distance from the tip of the endoscope to the laryngeal tissues. The purpose of this study was to develop a precise method for vertical calibration of the endoscope. Investigating the position of the laser points showed that, besides the vertical distance, they also depend on the parameters of the lens coupler, including the FOV position within the image frame and the rotation angle of the endoscope. The presented automatic calibration method was developed to compensate for the effect of these parameters. Statistical image processing and pattern recognition were used to detect the FOV, the center of FOV, and the fiducial marker. This step normalizes the HSV frames to a standard coordinate system and removes the dependence of the laser-point positions on the parameters of the lens coupler. Then, using a statistical learning technique, a calibration protocol was developed to model the trajectories of all laser points as the working distance was varied. Finally, a set of experiments was conducted to measure the accuracy and reliability of every step of the procedure. The system was able to measure absolute vertical distance with mean percent error in the range of 1.7% to 4.7%, depending on the working distance.
M. E. Powell, et al., “Efficacy of videostroboscopy and high-speed videoendoscopy to obtain functional outcomes from perioperative ratings in patients with mass lesions,” Journal of Voice, In Press. Publisher's Version
D. D. Mehta, et al., “Toward development of a vocal fold contact pressure probe: Bench-top validation of a dual-sensor probe using excised human larynx models,” Applied Sciences, vol. 9, no. 20, pp. 4360, 2019. Publisher's VersionAbstract
A critical element in understanding voice production mechanisms is the characterization of vocal fold collision, which is widely considered a primary etiological factor in the development of common phonotraumatic lesions such as nodules and polyps. This paper describes the development of a transoral, dual-sensor intraglottal/subglottal pressure probe for the simultaneous measurement of vocal fold collision and subglottal pressures during phonation using two miniature sensors positioned 7.6 mm apart at the distal end of a rigid cannula. Proof-of-concept testing was performed using excised whole-mount and hemilarynx human tissue aerodynamically driven into self-sustained oscillation, with systematic variation of the superior–inferior positioning of the vocal fold collision sensor. In the hemilarynx experiment, signals from the pressure sensors were synchronized with an acoustic microphone, a tracheal-surface accelerometer, and two high-speed video cameras recording at 4000 frames per second for top–down and en face imaging of the superior and medial vocal fold surfaces, respectively. As expected, the intraglottal pressure signal exhibited an impulse-like peak when vocal fold contact occurred, followed by a broader peak associated with intraglottal pressure build-up during the de-contacting phase. As subglottal pressure was increased, the peak amplitude of the collision pressure increased and typically reached a value below that of the average subglottal pressure. Results provide important baseline vocal fold collision pressure data with which computational models of voice production can be developed and in vivo measurements can be referenced.
G. Maguluri, D. Mehta, J. Kobler, J. Park, and N. Iftimia, “Synchronized, concurrent optical coherence tomography and videostroboscopy for monitoring vocal fold morphology and kinematics,” Biomedical Optics Express, vol. 10, no. 9, pp. 4450-4461, 2019. Publisher's VersionAbstract
Voice disorders affect a large number of adults in the United States, and their clinical evaluation heavily relies on laryngeal videostroboscopy, which captures the medial-lateral and anterior-posterior motion of the vocal folds using stroboscopic sampling. However, videostroboscopy does not provide direct visualization of the superior-inferior movement of the vocal folds, which yields important clinical insight. In this paper, we present a novel technology that complements videostroboscopic findings by adding the ability to image the coronal plane and visualize the superior-inferior movement of the vocal folds. The technology is based on optical coherence tomography, which is combined with videostroboscopy within the same endoscopic probe to provide spatially and temporally co-registered images of the mucosal wave motion, as well as vocal folds subsurface morphology. We demonstrate the capability of the rigid endoscopic probe, in a benchtop setting, to characterize the complex movement and subsurface structure of the aerodynamically driven excised larynx models within the 50 to 200 Hz phonation range. Our preliminary results encourage future development of this technology with the goal of its use for in vivo laryngeal imaging.
M. Motie-Shirazi, et al., “Toward development of a vocal fold contact pressure probe: Sensor characterization and validation using synthetic vocal fold models,” Applied Sciences, vol. 9, no. 15, pp. 3002, 2019. Publisher's VersionAbstract
Excessive vocal fold collision pressures during phonation are considered to play a primary role in the formation of benign vocal fold lesions, such as nodules. The ability to accurately and reliably acquire intraglottal pressure has the potential to provide unique insights into the pathophysiology of phonotrauma. Difficulties arise, however, in directly measuring vocal fold contact pressures due to physical intrusion from the sensor that may disrupt the contact mechanics, as well as difficulty in determining probe/sensor position relative to the contact location. These issues are quantified and addressed through the implementation of a novel approach for identifying the timing and location of vocal fold contact, and measuring intraglottal and vocal fold contact pressures via a pressure probe embedded in the wall of a hemi-laryngeal flow facility. The accuracy and sensitivity of the pressure measurements are validated against ground truth values. Application to in vivo approaches are assessed by acquiring intraglottal and VF contact pressures using a synthetic, self-oscillating vocal fold model in a hemi-laryngeal configuration, where the sensitivity of the measured intraglottal and vocal fold contact pressure relative to the sensor position is explored.
K. L. Marks, J. Z. Lin, A. Fox, L. E. Toles, and D. D. Mehta, “Impact of non-modal phonation on estimates of subglottal pressure from neck-surface acceleration in healthy speakers,” Journal of Speech, Language, and Hearing Research, vol. 62, no. 9, pp. 3339-3358, 2019. Publisher's VersionAbstract


The purpose of this study was to evaluate the effects of nonmodal phonation on estimates of subglottal pressure (Ps) derived from the magnitude of a neck-surface accelerometer (ACC) signal and to confirm previous findings regarding the impact of vowel contexts and pitch levels in a larger cohort of participants.


Twenty-six vocally healthy participants (18 women, 8 men) were asked to produce a series of p-vowel syllables with descending loudness in 3 vowel contexts (/a/, /i/, and /u/), 3 pitch levels (comfortable, high, and low), and 4 elicited phonatory conditions (modal, breathy, strained, and rough). Estimates of Ps for each vowel segment were obtained by averaging the intraoral air pressure plateau before and after each segment. The root-mean-square magnitude of the neck-surface ACC signal was computed for each vowel segment. Three linear mixed-effects models were used to statistically assess the effects of vowel, pitch, and phonatory condition on the linear relationship (slope and intercept) between Ps and ACC signal magnitude.


Results demonstrated statistically significant linear relationships between ACC signal magnitude and Ps within participants but with increased intercepts for the nonmodal phonatory conditions; slopes were affected to a lesser extent. Vowel and pitch contexts did not significantly affect the linear relationship between ACC signal magnitude and Ps.


The classic linear relationship between ACC signal magnitude and Ps is significantly affected when nonmodal phonation is produced by a speaker. Future work is warranted to further characterize nonmodal phonatory characteristics to improve the ACC-based prediction of Ps during naturalistic speech production.

A. J. Ortiz, et al., “Automatic speech and singing classification in ambulatory recordings for normal and disordered voices,” The Journal of the Acoustical Society of America, vol. 146, no. 1, pp. EL22–EL27, 2019. Publisher's VersionAbstract
Ambulatory voice monitoring is a promising tool for investigating phonotraumatic vocal hyperfunction (PVH), associated with the development of vocal fold lesions. Since many patients with PVH are professional vocalists, a classifier was developed to better understand phonatory mechanisms during speech and singing. Twenty singers with PVH and 20 matched healthy controls were monitored with a neck-surface accelerometer–based ambulatory voice monitor. An expert-labeled ground truth data set was used to train a logistic regression on 15 subject-pairs with fundamental frequency and autocorrelation peak amplitude as input features. Overall classification accuracy of 94.2% was achieved on the held-out test set.
O. Murton, S. Shattuck-Hufnagel, J. - Y. Choi, and D. D. Mehta, “Identifying a creak probability threshold for an irregular pitch period detection algorithm,” The Journal of the Acoustical Society of America, vol. 145, no. 5, pp. EL379–EL385, 2019. Publisher's VersionAbstract
Irregular pitch periods (IPPs) are associated with grammatically, pragmatically, and clinically significant types of nonmodal phonation, but are challenging to identify. Automatic detection of IPPs is desirable because accurately hand-identifying IPPs is time-consuming and requires training. The authors evaluated an algorithm developed for creaky voice analysis to automatically identify IPPs in recordings of American English conversational speech. To determine a perceptually relevant threshold probability, frame-by-frame creak probabilities were compared to hand labels, yielding a threshold of approximately 0.02. These results indicate a generally good agreement between hand-labeled IPPs and automatic detection, calling for future work investigating effects of linguistic and prosodic context.
D. D. Mehta, V. M. Espinoza, M. Zanartu, J. H. Van Stan, and R. E. Hillman, “The difference between first and second harmonic amplitudes correlates between glottal airflow and neck-surface accelerometer signals during phonation,” The Journal of the Acoustical Society of America, vol. 145, no. 5, pp. EL386–EL392, 2019. Publisher's VersionAbstract
Miniature high-bandwidth accelerometers on the anterior neck surface are used in laboratory and ambulatory settings to obtain vocal function measures. This study compared the widely applied L1–L2 measure (historically, H1–H2)—the difference between the log-magnitude of the first and second harmonics—computed from the glottal airflow waveform with L1–L2 derived from the raw neck-surface acceleration signal in 79 vocally healthy female speakers. Results showed a significant correlation (r = 0.72) between L1–L2 values estimated from both airflow and accelerometer signals, suggesting that raw accelerometer-based estimates of L1–L2 may be interpreted as reflecting glottal physiological parameters and voice quality attributes during phonation.
J. A. Whitfield, Z. Kriegel, A. M. Fullenkamp, and D. D. Mehta, “Effects of concurrent manual task performance on connected speech acoustics in individuals with Parkinson disease,” Journal of Speech, Language, and Hearing Research, vol. 62, no. 7, pp. 2099–2117, 2019. Publisher's VersionAbstract
Purpose: Prior investigations suggest that simultaneous
performance of more than 1 motor-oriented task may
exacerbate speech motor deficits in individuals with
Parkinson disease (PD). The purpose of the current
investigation was to examine the extent to which
performing a low-demand manual task affected the
connected speech in individuals with and without PD.
Method: Individuals with PD and neurologically healthy
controls performed speech tasks (reading and
extemporaneous speech tasks) and an oscillatory
manual task (a counterclockwise circle-drawing
task) in isolation (single-task condition) and concurrently
(dual-task condition).
Results: Relative to speech task performance, no changes
in speech acoustics were observed for either group when
the low-demand motor task was performed with the
concurrent reading tasks. Speakers with PD exhibited
a significant decrease in pause duration between the
single-task (speech only) and dual-task conditions
for the extemporaneous speech task, whereas control
participants did not exhibit changes in any speech
production variable between the single- and dual-task
Conclusions: Overall, there were little to no changes in
speech production when a low-demand oscillatory motor
task was performed with concurrent reading. For the
extemporaneous task, however, individuals with PD
exhibited significant changes when the speech and manual
tasks were performed concurrently, a pattern that was
not observed for control speakers.
Supplemental Material:
J. A. Whitfield and D. D. Mehta, “Examination of clear speech in Parkinson disease using passage-level vowel space metrics,” Journal of Speech, Language, and Hearing Research, vol. 62, no. 7, pp. 2082–2098, 2019. Publisher's VersionAbstract
Purpose: The purpose of the current study was to characterize
clear speech production for speakers with and without
Parkinson disease (PD) using several measures of working
vowel space computed from frequently sampled formant
Method: The 1st 2 formant frequencies were tracked for
a reading passage that was produced using habitual and
clear speaking styles by 15 speakers with PD and 15 healthy
control speakers. Vowel space metrics were calculated
from the distribution of frequently sampled formant frequency
tracks, including vowel space hull area, articulatory–acoustic
vowel space, and multiple vowel space density (VSD)
measures based on different percentile contours of the
formant density distribution.
Results: Both speaker groups exhibited significant
increases in the articulatory–acoustic vowel space and
VSD10, the area of the outermost (10th percentile)
contour of the formant density distribution, from habitual
to clear styles. These clarity-related vowel space increases
were significantly smaller for speakers with PD than
controls. Both groups also exhibited a significant increase
in vowel space hull area; however, this metric was not
sensitive to differences in the clear speech response
between groups. Relative to healthy controls, speakers
with PD exhibited a significantly smaller VSD90, the area
of the most central (90th percentile), densely populated
region of the formant space.
Conclusions: Using vowel space metrics calculated from
formant traces of the reading passage, the current work
suggests that speakers with PD do indeed reach the more
peripheral regions of the vowel space during connected
speech but spend a larger percentage of the time in more
central regions of formant space than healthy speakers.
Additionally, working vowel space metrics based on the
distribution of formant data suggested that speakers with
PD exhibited less of a clarity-related increase in formant
space than controls, a trend that was not observed for
perimeter-based measures of vowel space area.
J. P. Cortés, et al., “Ambulatory assessment of phonotraumatic vocal hyperfunction using glottal airflow measures estimated from neck-surface acceleration,” PLoS One, vol. 13, no. 12, pp. e0209017, 2018. Publisher's VersionAbstract
Phonotraumatic vocal hyperfunction (PVH) is associated with chronic misuse and/or abuse of voice that can result in lesions such as vocalfold nodules. The clinical aerodynamic assessment of vocal function has been recently shown to differentiate between patients with PVH and healthy controls to provide meaningful insight into pathophysiological mechanisms associated with these disorders. However, all current clinical assessment of PVH is incomplete because of its inability to objectively identify the type and extent of detrimental phonatory function that is associated with PVH during daily voice use. The current study sought to address this issue by incorporating, for the first time in a comprehensive ambulatory assessment, glottal airflow parameters estimated from a neck-mounted accelerometer and recorded to a smartphone-based voice monitor. We tested this approach on 48 patients with vocal fold nodules and 48 matched healthy-control subjects who each wore the voice monitor for a week. Seven glottal airflow features were estimated every 50 ms using an impedance-based inverse filtering scheme, and seven high-order summary statistics of each feature were computed every 5 minutes over voiced segments. Based on a univariate hypothesis testing, eight glottal airflow summary statistics were found to be statistically different between patient and healthy-control groups. L1-regularized logistic regression for a supervised classification task yielded a mean (standard deviation) area under the ROC curve of 0.82 (0.25) and an accuracy of 0.83 (0.14). These results outperform the state-of-the-art classification for the same classification task and provide a new avenue to improve the assessment and treatment of hyperfunctional voice disorders.
M. Brockmann-Bauser, J. E. Bohlender, and D. D. Mehta, “Acoustic perturbation measures improve with increasing vocal intensity in individuals with and without voice disorders,” Journal of Voice, vol. 32, no. 2, pp. 162-168, 2018. Publisher's VersionAbstract


In vocally healthy children and adults, speaking voice loudness differences can significantly confound acoustic perturbation measurements. This study examines the effects of voice sound pressure level (SPL) on jitter, shimmer, and harmonics-to-noise ratio (HNR) in adults with voice disorders and a control group with normal vocal status.

Study Design

This is a matched case-control study.


We assessed 58 adult female voice patients matched according to approximate age and occupation with 58 vocally healthy women. Diagnoses included vocal fold nodules (n = 39, 67.2%), polyps (n = 5, 8.6%), and muscle tension dysphonia (n = 14, 24.1%). All participants sustained the vowel /a/ at soft, comfortable, and loud phonation levels. Acoustic voice SPL, jitter, shimmer, and HNR were computed using Praat. The effects of loudness condition, voice SPL, pathology, differential diagnosis, age, and professional voice use level on acoustic perturbation measures were assessed using linear mixed models and Wilcoxon signed rank tests.


In both patient and normative control groups, increasing voice SPL correlated significantly (P < 0.001) with decreased jitter and shimmer, and increased HNR. Voice pathology and differential diagnosis were not linked to systematically higher jitter and shimmer. HNR levels, however, were statistically higher in the patient group than in the control group at comfortable phonation levels. Professional voice use level had a significant effect (P < 0.05) on jitter, shimmer, and HNR.


The clinical value of acoustic jitter, shimmer, and HNR may be limited if speaking voice SPL and professional voice use level effects are not controlled for. Future studies are warranted to investigate whether perturbation measures are useful clinical outcome metrics when controlling for these effects.

O. Murton, et al., “Acoustic speech analysis of patients with decompensated heart failure: A pilot study,” The Journal of the Acoustical Society of America, vol. 142, no. 4, pp. EL401-EL407, 2017. Publisher's VersionAbstract
This pilot study used acoustic speech analysis to monitor patients with heart failure (HF), which is characterized by increased intracardiac filling pressures and peripheral edema. HF-related edema in the vocal folds and lungs is hypothesized to affect phonation and speechrespiration. Acoustic measures of vocal perturbation and speech breathing characteristics were computed from sustained vowels and speechpassages recorded daily from ten patients with HF undergoing inpatient diuretic treatment. After treatment, patients displayed a higher proportion of automatically identified creaky voice, increased fundamental frequency, and decreased cepstral peak prominence variation, suggesting that speech biomarkers can be early indicators of HF.
M. Borsky, D. D. Mehta, J. H. Van Stan, and J. Gudnason, “Modal and nonmodal voice quality classification using acoustic and electroglottographic features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2281-2291, 2017. Publisher's VersionAbstract
The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonic model features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality from either of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities of modal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions from the acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.