The aim of this study was to establish reliability and validity for self-ratings of vocal status obtained during the daily activities of patients with vocal hyperfunction (VH) and matched controls.
Eight-four patients with VH and 74 participants with normal voices answered 3 vocal status questions-difficulty producing soft, high-pitched phonation (D-SHP); discomfort; and fatigue-on an ambulatory voice monitor at the beginning, 5-hr intervals, and the end of each day (7 total days). Two subsets of the patient group answered the questions during a 2nd week after voice therapy (29 patients) or laryngeal surgery (16 patients).
High reliability resulted for patients (Cronbach's α = .88) and controls (α = .95). Patients reported higher D-SHP, discomfort, and fatigue (Cohen's d = 1.62-1.92) compared with controls. Patients posttherapy and postsurgery reported significantly improved self-ratings of vocal status relative to their pretreatment ratings (d = 0.70-1.13). Within-subject changes in self-ratings greater than 20 points were considered clinically meaningful.
Ratings of D-SHP, discomfort, and fatigue have adequate reliability and validity for tracking vocal status throughout daily lifein patients with VH and vocally healthy individuals. These questions could help investigate the relationship between vocal symptom variability and putative contributing factors (e.g., voice use/rest, emotions).
Ambulatory monitoring of real-world voice characteristics and behavior has the potential to provide important assessment of voice and speech disorders and psychological and emotional state. In this paper, we report on the novel development of a lightweight, wireless voice monitor that synchronously records dual-channel data from an acoustic microphone and a neck-surface accelerometer embedded on a flex circuit. In this paper, Lombard speech effects were investigated in pilot data from four adult speakers with normal vocal function who read a phonetically balanced paragraph in the presence of different ambient acoustic noise levels. Whereas the signal-to-noise ratio (SNR) of the microphone signal decreased in the presence of increasing ambient noise level, the SNR of the accelerometer sensor remained high. Lombard speech properties were thus robustly computed from the accelerometer signal and observed in all four speakers who exhibited increases in average estimates of sound pressure level (+2.3 dB), fundamental frequency (+21.4 Hz), and cepstral peak prominence (+1.3 dB) from quiet to loud ambient conditions. Future work calls for ambulatory data collection in naturalistic environments, where the microphone acts as a sound level meter and the accelerometer functions as a noise-robust voicing sensor to assess voice disorders, neurological conditions, and cognitive load.
Special Session: New trends in imaging for speech production (Speech Communication Technical Committee)
During clinical voice assessment, laryngologists and speech-language pathologists rely heavily on laryngeal endoscopy with videostroboscopy to evaluate pathology and dysfunction of the vocal folds. The cost effectiveness, ease of use, and synchronized audio and visual feedback provided by videostroboscopic assessment serve to maintain its predominant clinical role in laryngeal imaging. However, significant drawbacks include only two-dimensional spatial imaging and the lack of subsurface morphological information. A novel endoscope will be presented that integrates optical coherence tomography that is spatially and temporally co-registered with laryngeal videoendoscopic technology through a common path probe. Optical coherence tomography is a non-contact, micron-resolution imaging technology that acts as a visual ultrasound that employs a scanning laser to measure reflectance properties at air-tissue and tissue-tissue boundaries. Results obtained from excised larynx experiments demonstrate enhanced visualization of three-dimensional vocal fold tissue kinematics and subsurface morphological changes during phonation. Real-time, calibrated three-dimensional imaging of the mucosal wave and subsurface layered microstructure of vocal fold tissue is expected to benefit in-office evaluation of benign and malignant tissue lesions. Future work calls for the in vivo evaluation of the technology in patients before and after surgical management of these types of lesions.
The purpose of this study was to determine the validity of preliminary reports showing that glottal aerodynamic measures can identify pathophysiological phonatory mechanisms for phonotraumatic and nonphonotraumatic vocal hyperfunction, which are each distinctly different from normal vocal function.
Glottal aerodynamic measures (estimates of subglottal air pressure, peak-to-peak airflow, maximum flow declination rate, and open quotient) were obtained noninvasively using a pneumotachograph mask with an intraoral pressure catheter in 16 women with organic vocal fold lesions, 16 women with muscle tension dysphonia, and 2 associated matched control groups with normal voices. Subjects produced /pae/ syllable strings from which glottal airflow was estimated using inverse filtering during /ae/ vowels, and subglottal pressure was estimated during /p/ closures. All measures were normalized for sound pressure level (SPL) and statistically tested for differences between patient and control groups.
All SPL-normalized measures were significantly lower in the phonotraumatic group as compared with measures in its control group. For the nonphonotraumatic group, only SPL-normalized subglottal pressure and open quotient were significantly lower than measures in its control group.
Results of this study confirm previous hypotheses and preliminary results indicating that SPL-normalized estimates of glottal aerodynamic measures can be used to describe the different pathophysiological phonatory mechanisms associated with phonotraumatic and nonphonotraumatic vocal hyperfunction.
To determine the validity of preliminary reports showing that glottal aerodynamic measures can identify pathophysiological phonatory mechanisms for phonotraumatic and non-phonotraumatic vocal hyperfunction that are each distinctly different from normal vocal function.
Glottal aerodynamic measures (estimates of subglottal air pressure, peak-to-peak airflow, maximum flow declination rate, and open quotient) were obtained non-invasively using a pneumotachograph mask with intra-oral pressure catheter in 16 adult females with organic vocal fold lesions, 16 adult females with muscle tension dysphonia, and two associated matched control groups with normal voices. Subjects produced /pae/ syllable strings from which glottal airflow was estimated using inverse filtering during /ae/ vowels, and subglottal pressure was estimated during /p/ closures. All measures were normalized for sound pressure level (SPL) and statistically tested for differences between patient and control groups.
All SPL-normalized measures were significantly lower in the phonotraumatic group as compared to measures in its control group. For the non-phonotraumatic group, only SPL-normalized subglottal pressure and open quotient were significantly lower than measures in its control group.
Results of this study confirm previous hypotheses and preliminary results indicating that SPL-normalized estimates of glottal aerodynamic measures can be used to describe the different pathophysiological phonatory mechanisms associated with phonotraumatic and non-phonotraumatic vocal hyperfunction.
Glottal inverse filtering aims to estimate the glottal airflow signal from a speech signal for applications such as speaker recognition and clinical voice assessment. Nonetheless, evaluation of inverse filtering algorithms has been challenging due to the practical difficulties of directly measuring glottal airflow. Apart from this, it is acknowledged that the performance of many methods degrade in voice conditions that are of great interest, such as breathiness, high pitch, soft voice, and running speech. This paper presents a comprehensive, objective, and comparative evaluation of state-of-the-art inverse filtering algorithms that takes advantage of speech and glottal airflow signals generated by a physiological speech synthesizer. The synthesizer provides a physics-based simulation of the voice production process and thus an adequate test bed for revealing the temporal and spectral performance characteristics of each algorithm. Included in the synthetic data are continuous speech utterances and sustained vowels, which are produced with multiple voice qualities (pressed, slightly pressed, modal, slightly breathy, and breathy), fundamental frequencies, and subglottal pressures to simulate the natural variations in real speech. In evaluating the accuracy of a glottal flow estimate, multiple error measures are used, including an error in the estimated signal that measures overall waveform deviation, as well as an error in each of several clinically relevant features extracted from the glottal flow estimate. Waveform errors calculated from glottal flow estimation experiments exhibited mean values around 30% for sustained vowels, and around 40% for continuous speech, of the amplitude of true glottal flow derivative. Closed-phase approaches showed remarkable stability across different voice qualities and subglottal pressures. The algorithms of choice, as suggested by significance tests, are closed-phase covariance analysis for the analysis of sustained vowels, and sparse linear prediction for the analysis of continuous speech. Results of data subset analysis suggest that analysis of close rounded vowels is an additional challenge in glottal flow estimation.
Purpose The purpose of this article is to examine the ability of an acoustic measure, relative fundamental frequency (RFF), to distinguish between two subtypes of vocal hyperfunction (VH): phonotraumatic (PVH) and non-phonotraumatic (NPVH).
Method RFF values were compared among control individuals with typical voices (N = 49), individuals with PVH (N = 54), and individuals with NPVH (N = 35).
Results Offset Cycle 10 RFF differed significantly among all 3 groups with values progressively decreasing for controls, individuals with NPVH, and individuals with PVH. Individuals with PVH also had lower Offset Cycles 8 and 9 relative to the other 2 groups and lower RFF values for Offset Cycle 7 relative to controls. There was also a trend for lower Onset Cycle 1 RFF values for the PVH group compared with the NPVH group.
Conclusions RFF values were significantly different between controls and individuals with VH and also between the two subtypes of VH. This study adds further support to the notion that the differences between these two subsets of VH may be functional as well as structural.
This article provides a summary of some recent innovations in voice assessment expected to have an impact in the next 5–10 years on how patients with voice disorders are clinically managed by speech-language pathologists. Specific innovations discussed are in the areas of laryngeal imaging, ambulatory voice monitoring, and “big data” analysis using machine learning to produce new metrics for vocal health. Also discussed is the potential for using voice analysis to detect and monitor other health conditions.
This study analyzes signals recorded using a neck-surface accelerometer from subjects producing speech with different voice modes. The purpose is to explore if the recorded waveforms can capture the glottal vibratory patterns which can be related to the movement of the vocal folds and thus voice quality. The accelerometer waveforms do not contain the supraglottal resonances, and these characteristics make the proposed method suitable for real-life voice quality assessment and monitoring as it does not breach patient privacy. The experiments with a Gaussian mexture model classifier demonstrate that different voice qualities produce distinctly different accelerometer waveforms. The system achieved 80.2% and 89.5% for frame- and utterance-level accuracy, respectively, for classifying among modal, breathy, pressed, and rough voice modes using a speaker-dependent classifier. Finally, the article presents characteristic waveforms for each modality and discusses their attributes.
Purpose Ambulatory voice biofeedback has the potential to significantly improve voice therapy effectiveness by targeting carryover of desired behaviors outside the therapy session (i.e., retention). This study applies motor learning concepts (reduced frequency and delayed, summary feedback) that demonstrate increased retention to ambulatory voice monitoring for training nurses to talk softer during work hours.
Method Forty-eight nurses with normal voices wore the Voice Health Monitor (Mehta, Zañartu, Feng, Cheyne, & Hillman, 2012) for 6 days: 3 baseline days, 1 biofeedback day, 1 short-term retention day, and 1 long-term retention day. Participants were block-randomized into 3 different biofeedback groups: 100%, 25%, and Summary. Performance was measured in terms of compliance time below a participant-specific vocal intensity threshold.
Results All participants exhibited a significant increase in compliance time (Cohen's d = 4.5) during biofeedback days compared with baseline days. The Summary feedback group exhibited statistically smaller performance reduction during both short-term (d = 1.14) and long-term (d = 1.04) retention days compared with the 100% feedback group.
Conclusions These findings suggest that modifications in feedback frequency and timing affect retention of a modified vocal behavior in daily life. Future work calls for studying the potential beneficial impact of ambulatory voice biofeedback in participants with behaviorally based voice disorders.
Purpose Ambulatory voice biofeedback (AVB) has the potential to significantly improve voice therapy effectiveness by targeting one of the most challenging aspects of rehabilitation: carryover of desired behaviors outside of the therapy session. Although initial evidence indicates that AVB can alter vocal behavior in daily life, retention of the new behavior after biofeedback has not been demonstrated. Motor learning studies repeatedly have shown retention-related benefits when reducing feedback frequency or providing summary statistics. Therefore, novel AVB settings that are based on these concepts are developed and implemented.
Method The underlying theoretical framework and resultant implementation of innovative AVB settings on a smartphone-based voice monitor are described. A clinical case study demonstrates the functionality of the new relative frequency feedback capabilities.
Results With new technical capabilities, 2 aspects of feedback are directly modifiable for AVB: relative frequency and summary feedback. Although reduced-frequency AVB was associated with improved carryover of a therapeutic vocal behavior (i.e., reduced vocal intensity) in a patient post-excision of vocal fold nodules, causation cannot be assumed.
Conclusions Timing and frequency of AVB schedules can be manipulated to empirically assess generalization of motor learning principles to vocal behavior modification and test the clinical effectiveness of AVB with various feedback schedules.
indirect physiological signal to predict the phase of the vocal fold vibratory cycle for sampling. Simulated stroboscopy (SS) extracts the phase of the glottal cycle directly from the changing glottal area in the high-speed videoendoscopy (HSV) image sequence. The purpose of this study is to determine the reliability of SS relative to VS for clinical assessment of vocal fold vibratory function in patients with mass lesions.
Methods VS and SS recordings were obtained from 28 patients with vocal fold mass lesions before and after phonomicrosurgery and 17 controls who were vocally healthy. Two clinicians rated clinically relevant vocal fold vibratory features using both imaging techniques, indicated their internal level of confidence in the accuracy of their ratings, and provided reasons for low or no confidence.
Results SS had fewer asynchronous image sequences than VS. Vibratory outcomes were able to be computed for more patients using SS. In addition, raters demonstrated better interrater reliability and reported equal or higher levels of confidence using SS than VS.
Conclusion Stroboscopic techniques on the basis of extracting the phase directly from the HSV image sequence are more reliable than acoustic-based VS. Findings suggest that SS derived from high-speed videoendoscopy is a promising improvement over current VS systems.
It has been proven that the improper function of the vocal folds can result in perceptually distorted speech that is typically identified with various speech pathologies or even some neurological diseases. As a consequence, researchers have focused on finding quantitative voice characteristics to objectively assess and automatically detect non-modal voice types. The bulk of the research has focused on classifying the speech modality by using the features extracted from the speech signal. This paper proposes a different approach that focuses on analyzing the signal characteristics of the electroglottogram (EGG) waveform. The core idea is that modal and different kinds of non-modal voice types produce EGG signals that have distinct spectral/cepstral characteristics. As a consequence, they can be distinguished from each other by using standard cepstral-based features and a simple multivariate Gaussian mixture model. The practical usability of this approach has been verified in the task of classifying among modal, breathy, rough, pressed and soft voice types. We have achieved 83% frame-level accuracy and 91% utterance-level accuracy by training a speaker-dependent system.