OBJECTIVE: Relative fundamental frequency (RFF) has been suggested as a potential acoustic measure of vocal effort. However, current clinical standards for RFF measures require time-consuming manual markings. Previous semi-automated algorithms have been developed to calculate RFF from microphone signals. The current study aimed to develop fully automated algorithms to calculate RFF from neck-surface accelerometer signals for ecological momentary assessment and ambulatory monitoring of voice. METHODS: Training a set of 2646 /vowel-fricative-vowel/ utterances from 317 unique speakers, with and without voice disorders, was used to develop automated algorithms to calculate RFF values from neck-surface accelerometer signals. The algorithms first rejected utterances with poor vowel-to-noise ratios, then identified fricative locations, then used signal features to determine voicing boundary cycles, and finally calculated corresponding RFF values. These automated RFF values were compared to the clinical gold-standard of manual RFF calculated from simultaneously collected microphone signals in a novel test set of 639 utterances from 77 unique speakers. RESULTS: Automated accelerometer-based RFF values resulted in an average mean bias error (MBE) across all cycles of 0.027 ST, with an MBE of 0.152 ST and -0.252 ST in the offset and onset cycles closest to the fricative, respectively. CONCLUSION: All MBE values were smaller than the expected changes in RFF values following successful voice therapy, suggesting that the current algorithms could be used for ecological momentary assessment and ambulatory monitoring via neck-surface accelerometer signals.
OBJECTIVE: Singers undergoing tonsillectomy are understandably concerned about possible sequelae to their voice. The surgical risks of laryngeal damage from intubation and upper airway scarring are valid reasons for singers to carefully consider their options for treatment of tonsil-related symptoms. No prior studies have statistically assessed objective voice outcomes in a group of adult singers undergoing tonsillectomy. This study determined the impact of tonsillectomy on the adult singing voice by determining if there were statistically significant changes in preoperative versus postoperative acoustic, aerodynamic, and Voice-Related Quality of Life (VRQOL) measures. STUDY DESIGN: Prospective cohort study. SETTING: Tertiary Referral Academic Hospital SUBJECTS: Thirty singers undergoing tonsillectomy from 2012 to 2019. METHODS: Acoustic recordings were obtained with Computerized Speech Lab (CSL) (Pentax CSL 4500) and analyzed with the Multidimensional Voice Program (MDVP) (Pentax MDVP) and Pratt Acoustic Analysis Software. Estimates of aerodynamic vocal efficiency were obtained and analyzed using the Phonatory Aerodynamic System (Pentax PAS 6600). Preoperative VRQOL scores were recorded, and singers were instructed to refrain from singing for 3 weeks following tonsillectomy. Repeat acoustic and aerodynamic measures as well as VRQOL scores were obtained at the first postoperative visit. RESULTS: Average postoperative acoustic (jitter, shimmer, HNR) and aerodynamic (sound pressure level divided by subglottal pressure) parameters related to laryngeal phonatory function did not differ significantly from preoperative measures. The only statistically significant change in postoperative measures of resonance was a decrease in the 3rd formant (F3) for the /a/ vowel. Average postoperative VRQOL scores (79.8, SD18.7) improved significantly from preoperative VRQOL scores (89, SD12.2) (P = 0.007). CONCLUSIONS: Tonsillectomy does not appear to alter laryngeal voice production in adult singers as measured by standard acoustic and aerodynamic parameters. The observed decrease in F3 for the /a/ vowel is hypothetically related to increasing the pharyngeal cross-sectional area by removing tonsillar tissue, but this would not be expected to appreciably impact the perceptual characteristics of the vowel. Singers' self-assessment (VRQOL) improved after tonsillectomy.
The ambulatory assessment of vocal function can be significantly enhanced by having access to physiologically based features that describe underlying pathophysiological mechanisms in individuals with voice disorders. This type of enhancement can improve methods for the prevention, diagnosis, and treatment of behaviorally based voice disorders. Unfortunately, the direct measurement of important vocal features such as subglottal pressure, vocal fold collision pressure, and laryngeal muscle activation is impractical in laboratory and ambulatory settings. In this study, we introduce a method to estimate these features during phonation from a neck-surface vibration signal through a framework that integrates a physiologically relevant model of voice production and machine learning tools. The signal from a neck-surface accelerometer is first processed using subglottal impedance-based inverse filtering to yield an estimate of the unsteady glottal airflow. Seven aerodynamic and acoustic features are extracted from the neck surface accelerometer and an optional microphone signal. A neural network architecture is selected to provide a mapping between the seven input features and subglottal pressure, vocal fold collision pressure, and cricothyroid and thyroarytenoid muscle activation. This non-linear mapping is trained solely with 13,000 Monte Carlo simulations of a voice production model that utilizes a symmetric triangular body-cover model of the vocal folds. The performance of the method was compared against laboratory data from synchronous recordings of oral airflow, intraoral pressure, microphone, and neck-surface vibration in 79 vocally healthy female participants uttering consecutive /pæ/ syllable strings at comfortable, loud, and soft levels. The mean absolute error and root-mean-square error for estimating the mean subglottal pressure were 191 Pa (1.95 cm H(2)O) and 243 Pa (2.48 cm H(2)O), respectively, which are comparable with previous studies but with the key advantage of not requiring subject-specific training and yielding more output measures. The validation of vocal fold collision pressure and laryngeal muscle activation was performed with synthetic values as reference. These initial results provide valuable insight for further vocal fold model refinement and constitute a proof of concept that the proposed machine learning method is a feasible option for providing physiologically relevant measures for laboratory and ambulatory assessment of vocal function.
Cepstrum-based voice measures, such as smoothed cepstral peak prominence (CPPS), are influenced by voice sound pressure level (SPL) in vocally healthy adults. Since it is unclear if similar effects hold in voice disordered adults and how these interact with natural fundamental frequency (fo) changes, this study examines voice SPL and fo effects on CPPS in women with vocal hyperfunction and vocally healthy controls.
Retrospective matched case-control study.
Fifty-eight women with vocal hyperfunction were individually matched with 58 vocally healthy women for occupation and approximate age. The patient group comprised women exhibiting phonotraumatic vocal hyperfunction associated with vocal fold nodules (n = 39) or polyps (n = 5), and nonphonotraumatic vocal hyperfunction associated with primary muscle tension dysphonia (n = 14). All participants sustained the vowel /a/ at soft, comfortable, and loud loudness conditions. Voice SPL, fo, and CPPS (dB) were computed from acoustic voice recordings using Praat. The effects of loudness condition, measured voice SPL, and fo on CPPS were assessed with linear mixed models. Pairwise correlations among voice SPL, fo, and CPPS were assessed using multiple regression analysis.
Increasing voice SPL correlated significantly (P < 0.001) with higher CPPS in both patient (r2 = 0.53) and normative groups (r2 = 0.45). fo had statistically significant effects on CPPS (P < 0.001), but with a weak relation for the patient (r2 = 0.02) and control groups (r2 = 0.05).
In women with and without voice disorder, CPPS is highly affected by the individual's voice SPL in vowel phonation. Future studies could investigate how these effects should be controlled for to improve the diagnostic value of acoustic-based cepstral measures.
Purpose The purpose of this viewpoint article is to facilitate research on vocal hyperfunction (VH). VH is implicated in the most commonly occurring types of voice disorders, but there remains a pressing need to increase our understanding of the etiological and pathophysiological mechanisms associated with VH to improve the prevention, diagnosis, and treatment of VH-related disorders. Method A comprehensive theoretical framework for VH is proposed based on an integration of prevailing clinical views and research evidence. Results The fundamental structure of the current framework is based on a previous (simplified) version that was published over 30 years ago (Hillman et al., 1989). A central premise of the framework is that there are two primary manifestations of VH-phonotraumatic VH and nonphonotraumatic VH-and that multiple factors contribute and interact in different ways to cause and maintain these two types of VH. Key hypotheses are presented about the way different factors may contribute to phonotraumatic VH and nonphonotraumatic VH and how the associated disorders may respond to treatment. Conclusions This updated and expanded framework is meant to help guide future research, particularly the design of longitudinal studies, which can lead to a refinement in knowledge about the etiology and pathophysiology of VH-related disorders. Such new knowledge should lead to further refinements in the framework and serve as a basis for improving the prevention and evidence-based clinical management of VH.
The goal of this study was to employ frequently used analysis methods and tasks to identify values for cepstral peak prominence (CPP) that can aid clinical voice evaluation. Experiment 1 identified CPP values to distinguish speakers with and without voice disorders. Experiment 2 was an initial attempt to estimate auditory-perceptual ratings of overall dysphonia severity using CPP values.
CPP was computed using the Analysis of Dysphonia in Speech and Voice (ADSV) program and Praat. Experiment 1 included recordings from 295 patients with medically diagnosed voice disorders and 50 vocally healthy control speakers. Speakers produced sustained /a/ vowels and the English language Rainbow Passage. CPP cutoff values that best distinguished patient and control speakers were identified. Experiment 2 analyzed recordings from 32 English speakers with varying dysphonia severity and provided preliminary validation of the Experiment 1 cutoffs. Speakers sustained the /a/ vowel and read four sentences from the Consensus Auditory-Perceptual Evaluation of Voice protocol. Trained listeners provided auditory-perceptual ratings of overall dysphonia for the recordings, which were estimated using CPP values in a linear regression model whose performance was evaluated using the coefficient of determination (r2).
Experiment 1 identified CPP cutoff values of 11.46 dB (ADSV) and 14.45 dB (Praat) for the sustained /a/ vowels and 6.11 dB (ADSV) and 9.33 dB (Praat) for the Rainbow Passage. CPP values below those thresholds indicated the presence of a voice disorder with up to 94.5% accuracy. In Experiment 2, CPP values estimated ratings of overall dysphonia with r2 values up to .74.
The CPP cutoff values identified in Experiment 1 provide normative reference points for clinical voice evaluation based on sustained /a/ vowels and the Rainbow Passage. Experiment 2 provides an initial predictive framework that can be used to relate CPP values to the auditory perception of overall dysphonia severity based on sustained /a/ vowels and Consensus Auditory-Perceptual Evaluation of Voice sentences.
Given the established linear relationship between neck surface vibration magnitude and mean subglottal pressure (Ps) in vocally healthy speakers, the purpose of this study was to better understand the impact of the presence of a voice disorder on this baseline relationship.
Data were obtained from participants with voice disorders representing a variety of glottal conditions, including phonotraumatic vocal hyperfunction, nonphonotraumatic vocal hyperfunction, and unilateral vocal fold paralysis. Participants were asked to repeat /p/-vowel syllable strings from loud-to-soft loudness levels in multiple vowel contexts (/pa/, /pi/, /pu/) and pitch levels (comfortable, higher than comfortable, lower than comfortable). Three statistical metrics were computed to analyze the regression line between neck surface accelerometer (ACC) signal magnitude and Ps within and across pitch, vowel, and voice disorder category: coefficient of determination (r2), slope, and intercept. Three linear mixed-effects models were used to evaluate the impact of voice disorder category, pitch level, and vowel context on the relationship between ACC signal magnitude and Ps.
The relationship between ACC signal magnitude and Ps was statistically different in patients with voice disorders than in vocally healthy controls; patients exhibited higher levels of Ps given similar values of ACC signal magnitude. Negligible effects were found for pitch condition within each voice disorder category, and negligible-to-small effects were found for vowel context. The mean of patient-specific r2 values was .63, ranging from .13 to .92.
The baseline, linear relationship between ACC signal magnitude and Ps is affected by the presence of a voice disorder, with the relationship being participant-specific. Further work is needed to improve ACC-based prediction of Ps, across treatment, and during naturalistic speech production.
Subglottal air pressure plays a major role in voice production and is a primary factor in controlling voice onset, offset, sound pressure level, glottal airflow, vocal fold collision pressures, and variations in fundamental frequency. Previous work has shown promise for the estimation of subglottal pressure from an unobtrusive miniature accelerometer sensor attached to the anterior base of the neck during typical modal voice production across multiple pitch and vowel contexts. This study expands on that work to incorporate additional accelerometer-based measures of vocal function to compensate for non-modal phonation characteristics and achieve an improved estimation of subglottal pressure. Subjects with normal voices repeated /p/-vowel syllable strings from loud-to-soft levels in multiple vowel contexts (/a/, /i/, and /u/), pitch conditions (comfortable, lower than comfortable, higher than comfortable), and voice quality types (modal, breathy, strained, and rough). Subject-specific, stepwise regression models were constructed using root-mean-square (RMS) values of the accelerometer signal alone (baseline condition) and in combination with cepstral peak prominence, fundamental frequency, and glottal airflow measures derived using subglottal impedance-based inverse filtering. Five-fold cross-validation assessed the robustness of model performance using the root-mean-square error metric for each regression model. Each cross-validation fold exhibited up to a 25% decrease in prediction error when the model incorporated multi-dimensional aspects of the accelerometer signal compared with RMS-only models. Improved estimation of subglottal pressure for non-modal phonation was thus achievable, lending to future studies of subglottal pressure estimation in patients with voice disorders and in ambulatory voice recordings.
Irregular pitch periods (IPPs) are associated with grammatically, pragmatically, and clinically significant types of nonmodal phonation, but are challenging to identify. Automatic detection of IPPs is desirable because accurately hand-identifying IPPs is time-consuming and requires training. The authors evaluated an algorithm developed for creaky voice analysis to automatically identify IPPs in recordings of American English conversational speech. To determine a perceptually relevant threshold probability, frame-by-frame creak probabilities were compared to hand labels, yielding a threshold of approximately 0.02. These results indicate a generally good agreement between hand-labeled IPPs and automatic detection, calling for future work investigating effects of linguistic and prosodic context.
In vocally healthy children and adults, speaking voice loudness differences can significantly confound acoustic perturbation measurements. This study examines the effects of voice sound pressure level (SPL) on jitter, shimmer, and harmonics-to-noise ratio (HNR) in adults with voice disorders and a control group with normal vocal status.
This is a matched case-control study.
We assessed 58 adult female voice patients matched according to approximate age and occupation with 58 vocally healthy women. Diagnoses included vocal fold nodules (n = 39, 67.2%), polyps (n = 5, 8.6%), and muscle tension dysphonia (n = 14, 24.1%). All participants sustained the vowel /a/ at soft, comfortable, and loud phonation levels. Acoustic voice SPL, jitter, shimmer, and HNR were computed using Praat. The effects of loudness condition, voice SPL, pathology, differential diagnosis, age, and professional voice use level on acoustic perturbation measures were assessed using linear mixed models and Wilcoxon signed rank tests.
In both patient and normative control groups, increasing voice SPL correlated significantly (P < 0.001) with decreased jitter and shimmer, and increased HNR. Voice pathology and differential diagnosis were not linked to systematically higher jitter and shimmer. HNR levels, however, were statistically higher in the patient group than in the control group at comfortable phonation levels. Professional voice use level had a significant effect (P < 0.05) on jitter, shimmer, and HNR.
The clinical value of acoustic jitter, shimmer, and HNR may be limited if speaking voice SPL and professional voice use level effects are not controlled for. Future studies are warranted to investigate whether perturbation measures are useful clinical outcome metrics when controlling for these effects.
This pilot study used acoustic speech analysis to monitor patients with heart failure (HF), which is characterized by increased intracardiac filling pressures and peripheral edema. HF-related edema in the vocal folds and lungs is hypothesized to affect phonation and speech respiration. Acoustic measures of vocal perturbation and speech breathing characteristics were computed from sustained vowels and speechpassages recorded daily from ten patients with HF undergoing inpatient diuretic treatment. After treatment, patients displayed a higher proportion of automatically identified creaky voice, increased fundamental frequency, and decreased cepstral peak prominence variation, suggesting that speech biomarkers can be early indicators of HF.
The goal of this study was to investigate the performance of different feature types for voice quality classification using multiple classifiers. The study compared the COVAREP feature set; which included glottal source features, frequency warped cepstrum, and harmonic model features; against the mel-frequency cepstral coefficients (MFCCs) computed from the acoustic voice signal, acoustic-based glottal inverse filtered (GIF) waveform, and electroglottographic (EGG) waveform. Our hypothesis was that MFCCs can capture the perceived voice quality from either of these three voice signals. Experiments were carried out on recordings from 28 participants with normal vocal status who were prompted to sustain vowels with modal and nonmodal voice qualities. Recordings were rated by an expert listener using the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V), and the ratings were transformed into a dichotomous label (presence or absence) for the prompted voice qualities of modal voice, breathiness, strain, and roughness. The classification was done using support vector machines, random forests, deep neural networks, and Gaussian mixture model classifiers, which were built as speaker independent using a leave-one-speaker-out strategy. The best classification accuracy of 79.97% was achieved for the full COVAREP set. The harmonic model features were the best performing subset, with 78.47% accuracy, and the static+dynamic MFCCs scored at 74.52%. A closer analysis showed that MFCC and dynamic MFCC features were able to classify modal, breathy, and strained voice quality dimensions from the acoustic and GIF waveforms. Reduced classification performance was exhibited by the EGG waveform.
Successful voice training (e.g., singing lessons) and vocal rehabilitation (e.g., therapy for a voice disorder) involve learning complex, vocalbehaviors. However, there are no metrics describing how humans learn new vocal skills or predicting how long the improved behavior will persist post-therapy. To develop measures capable of describing and predicting vocal motor learning, a theory-based paradigm from limb motor control inspired the development of a virtual task where subjects throw projectiles at a target via modifications in vocal pitch and loudness. Ten subjects with healthy voices practiced this complex vocal task for five days. The many-to-one mapping between the execution variables pitch and loudness and resulting target error was evaluated using an analysis that quantified distributional properties of variability: Tolerance, noise, covariation costs (TNC costs). Lag-1 autocorrelation (AC1) and detrended-fluctuation-analysis scaling index (SCI) analyzed temporal aspects of variability. Vocal data replicated limb-based findings: TNC costs were positively correlated with error; AC1 and SCI were modulated in relation to the task's solution manifold. The data suggests that vocal and limb motor learning are similar in how the learner navigates the solution space. Future work calls for investigating the game's potential to improve voice disorder diagnosis and treatment.
This study examined the relationship between the magnitude of neck-surface vibration (NSVMag; transduced with an accelerometer) and intraoral estimates of subglottal pressure (P'sg) during variations in vocal effort at 3 intensity levels.
Twelve vocally healthy adults produced strings of /pɑ/ syllables in 3 vocal intensity conditions, while increasing vocal effort during each condition. Measures were made of P'sg (estimated during stop-consonant closure), NSVMag (measured during the following vowel), sound pressure level, and respiratory kinematics. Mixed linear regression was used to analyze the relationship between NSVMag and P'sg with respect to total lung volume excursion, levels of lung volume initiation and termination, airflow, laryngeal resistance, and vocal efficiency across intensity conditions.
NSVMag was significantly related to P'sg (p < .001), and there was a significant, although small, interaction between NSVMag and intensity condition. Total lung excursion was the only additional variable contributing to predicting the NSVMag-P'sg relationship.
NSVMag closely reflects P'sg during variations of vocal effort; however, the relationship changes across different intensities in some individuals. Future research should explore additional NSV-based measures (e.g., glottal airflow features) to improve estimation accuracy during voice production.
Relative fundamental frequency (RFF) has shown promise as an acoustic measure of voice, but the subjective and time-consuming nature of its manual estimation has made clinical translation infeasible. Here, a faster, more objective algorithm for RFF estimation is evaluated in a large and diverse sample of individuals with and without voice disorders.
Acoustic recordings were collected from 154 individuals with voice disorders and 36 age- and sex-matched controls with typical voices. These recordings were split into training and 2 testing sets. Using an algorithm tuned to the training set, semi-automated RFF estimates in the testing sets were compared to manual RFF estimates derived from 3 trained technicians.
The semi-automated RFF estimations were highly correlated ( r = 0.82-0.91) with the manual RFF estimates.
Fast and more objective estimation of RFF makes large-scale RFF analysis feasible. This algorithm allows for future work to optimize RFF measures and expand their potential for clinical voice assessment.
Purpose The purpose of this article is to examine the ability of an acoustic measure, relative fundamental frequency (RFF), to distinguish between two subtypes of vocal hyperfunction (VH): phonotraumatic (PVH) and non-phonotraumatic (NPVH).
Method RFF values were compared among control individuals with typical voices (N = 49), individuals with PVH (N = 54), and individuals with NPVH (N = 35).
Results Offset Cycle 10 RFF differed significantly among all 3 groups with values progressively decreasing for controls, individuals with NPVH, and individuals with PVH. Individuals with PVH also had lower Offset Cycles 8 and 9 relative to the other 2 groups and lower RFF values for Offset Cycle 7 relative to controls. There was also a trend for lower Onset Cycle 1 RFF values for the PVH group compared with the NPVH group.
Conclusions RFF values were significantly different between controls and individuals with VH and also between the two subtypes of VH. This study adds further support to the notion that the differences between these two subsets of VH may be functional as well as structural.
This article provides a summary of some recent innovations in voice assessment expected to have an impact in the next 5–10 years on how patients with voice disorders are clinically managed by speech-language pathologists. Specific innovations discussed are in the areas of laryngeal imaging, ambulatory voice monitoring, and “big data” analysis using machine learning to produce new metrics for vocal health. Also discussed is the potential for using voice analysis to detect and monitor other health conditions.
This study analyzes signals recorded using a neck-surface accelerometer from subjects producing speech with different voice modes. The purpose is to explore if the recorded waveforms can capture the glottal vibratory patterns which can be related to the movement of the vocal folds and thus voice quality. The accelerometer waveforms do not contain the supraglottal resonances, and these characteristics make the proposed method suitable for real-life voice quality assessment and monitoring as it does not breach patient privacy. The experiments with a Gaussian mexture model classifier demonstrate that different voice qualities produce distinctly different accelerometer waveforms. The system achieved 80.2% and 89.5% for frame- and utterance-level accuracy, respectively, for classifying among modal, breathy, pressed, and rough voice modes using a speaker-dependent classifier. Finally, the article presents characteristic waveforms for each modality and discusses their attributes.
It has been proven that the improper function of the vocal folds can result in perceptually distorted speech that is typically identified with various speech pathologies or even some neurological diseases. As a consequence, researchers have focused on finding quantitative voice characteristics to objectively assess and automatically detect non-modal voice types. The bulk of the research has focused on classifying the speech modality by using the features extracted from the speech signal. This paper proposes a different approach that focuses on analyzing the signal characteristics of the electroglottogram (EGG) waveform. The core idea is that modal and different kinds of non-modal voice types produce EGG signals that have distinct spectral/cepstral characteristics. As a consequence, they can be distinguished from each other by using standard cepstral-based features and a simple multivariate Gaussian mixture model. The practical usability of this approach has been verified in the task of classifying among modal, breathy, rough, pressed and soft voice types. We have achieved 83% frame-level accuracy and 91% utterance-level accuracy by training a speaker-dependent system.