The contrast sensitivity function (CSF) relates the visibility of a spatial pattern to both its size and contrast, and is therefore a more comprehensive assessment of visual function than acuity, which only determines the smallest resolvable pattern size. Because of the additional dimension of contrast, estimating the CSF can be more time-consuming. Here, we compare two methods for rapid assessment of the CSF that were implemented on a tablet device. For a single-trial assessment, we asked 63 myopes and 38 emmetropes to tap the peak of a "sweep grating" on the tablet's touch screen. For a more precise assessment, subjects performed 50 trials of the quick CSF method in a 10-AFC letter recognition task. Tests were performed with and without optical correction, and in monocular and binocular conditions; one condition was measured twice to assess repeatability.
Results show that both methods are highly correlated; using both common and novel measures for test-retest repeatability, however, the quick CSF delivers more precision with testing times of under three minutes. Further analyses show how a population prior can improve convergence rate of the quick CSF, and how the multi-dimensional output of the quick CSF can provide greater precision than scalar outcome measures.
We here present parts of our ongoing work to facilitate the large-scale analysis of smooth pursuit eye movements made while viewing dynamic natural scenes. Classification of smooth pursuit episodes can be difficult in the presence of eye-tracking noise, and we thus recently proposed an algorithm that clusters gaze recordings from several observers in order to improve classification robustness. We now implemented a publicly available tool that allows for generation of a ground truth benchmark by assisted hand-labelling of video gaze data. Based on the labelling produced with the tool we present preliminary evaluation results for our smooth pursuit classification approach in comparison to state-of-the-art algorithms. Overall, human observers spend more than 12% of their viewing time performing smooth pursuit, which emphasizes the importance of investigating smooth pursuit behaviour in naturalistic contexts.
Gaze holds great potential for fast and intuitive hands-free user interaction. However, existing methods typically suffer from the Midas touch problem, i.e. the difficult distinction between gaze for perception and for user action; proposed solutions have required custom-tailored, application-specific user interfaces. Here, we present SPOCK, a novel gaze interaction method based on smooth pursuit eye movements requiring only minimal extensions to button-based interfaces. Upon looking at a UI element, two overlaid dynamic stimuli appear and tracking one of them triggers activation. In contrast to fixations and saccades, smooth pursuits are not only easily performed, but also easily suppressed, thus greatly reducing the Midas touch problem. We evaluated SPOCK against dwell time, the state-of-the-art gaze interaction method, in a simple target selection and a more challenging multiple-choice scenario. At higher task difficulty, unintentional target activations were reduced almost 15-fold by SPOCK, making this a promising method for gaze interaction.
While many elaborate algorithms to classify eye movements into fixations and saccades exist, detection of smooth pursuit eye movements is still challenging. Smooth pursuits do not occur for the predominantly studied static stimuli; for dynamic stimuli, it is difficult to distinguish small gaze displacements due to noise from smooth pursuit. We propose to improve noise robustness by combining information from multiple recordings: if several people show similar gaze patterns that are neither fixations nor saccades, these episodes are likely smooth pursuits. We evaluated our approach against two baseline algorithms on a hand-labelled subset of the GazeCom data set of dynamic natural scenes, using three different clustering algorithms to determine gaze similarity. Results show that our approach achieves a very substantial increase in precision at improved recall over state-of-the-art algorithms that consider individual gaze traces only.
A key property of human visual behavior is the very frequent movement of our eyes to potentially relevant information in the environment. Observers thus continuously have to prioritize information for directing their eyes to. Research in this field has been hampered by a lack of appropriate measures and tools. Here, we propose and validate a novel measure of priority that takes advantage of the variability in the natural viewing behavior of individual observers. In short, our measure assumes that priority is low when observers' gaze behavior is inconsistent and high when it is very consistent. We calculated priority for gaze data obtained during an experiment in which participants viewed dynamic natural scenes while we simultaneously recorded their gaze position and brain activity using functional magnetic resonance imaging. Our priority measure shows only limited correlation with various saliency, surprise, and motion measures, indicating it is assessing a distinct property of visual behavior. Finally, we correlated our priority measure with the BOLD signal, thereby revealing activity in a select number of human occipital and parietal areas. This suggests the presence of a cortical network involved in computing and representing viewing priority. We conclude that our new analysis method allows for empirically establishing the priority of events in near-natural vision paradigms.
Background Impaired low-contrast visual acuity (LCVA) is common in multiple sclerosis (MS) and other neurological diseases. Its assessment is often limited to selected contrasts, for example, 2.5% or 1.25%. Computerized adaptive testing with the quick contrast-sensitivity function (qCSF) method allows assessment across expanded contrast and spatial frequency ranges.
Objective The objective of this article is to compare qCSF with high- and low-contrast charts and patient-reported visual function.
Methods We enrolled 131 consecutive MS patients (mean age 39.6 years) to assess high-contrast visual acuity (HCVA) at 30 cm and 5 m, low-contrast vision with Sloan charts at 2.5% and 1.25%, qCSF and the National Eye Institute Visual Functioning Questionnaire (NEIVFQ). Associations between the different measures were estimated with linear regression models corrected for age, gender and multiple testing.
Results The association between qCSF and Sloan charts (R2 = 0.68) was higher than with HCVA (5 m: R2 = 0.5; 30 cm: R2 = 0.41). The highest association with NEIVFQ subscales was observed for qCSF (R2 0.20–0.57), while Sloan charts were not associated with any NEIVFQ subscale after correction for multiple testing.
Conclusion The qCSF is a promising new outcome for low-contrast vision in MS and other neurological diseases. Here we show a closer link to patient-reported visual function than standard low- and high-contrast charts.
The Contrast Sensitivity Function relates the spatial frequency and contrast of a spatial pattern to its visibility and thus provides a fundamental description of visual function. However, the current clinical standard of care typically restricts assessment to visual acuity, i.e. the smallest stimulus size that can be resolved at full contrast; alternatively, tests of contrast sensitivity are typically restricted to assessment of the lowest visible contrast for a fixed letter size. This restriction to one-dimensional subspaces of a two-dimensional space was necessary when stimuli were printed on paper charts and simple scoring rules were applied manually. More recently, however, computerized testing and electronic screens have enabled more flexible stimulus displays and more complex test algorithms. For example, the quick CSF method uses a Bayesian adaptive procedure and an information maximization criterion to select only informative stimuli; testing times to precisely estimate the whole contrast sensitivity function are reduced to 2-5 minutes. Here, we describe the implementation of the quick CSF method in a medical device. We make several usability enhancements to make it suitable for use in clinical settings. A first usability study shows excellent results, with a mean System Usability Scale score of 86.5.
The contrast sensitivity function (CSF) provides a fundamental characterization of spatial vision, important for basic and clinical applications, but its long testing times have prevented easy, widespread assessment. The quick CSF method was developed using a 2AFC grating orientation identification task (Lesmes, Lu, Baek, & Albright, 2010), and obtained precise CSF assessments while reducing the testing burden to only 50 trials. In this study, we attempt to further improve the quick CSF’s efficiency by exploiting the properties of psychometric functions in multiple-alternative forced choice (m-AFC) tasks. A simulation study evaluated the effect of the number of alternatives m on the efficiency of the sensitivity measurement by the quick CSF, and a psychophysical study validated the quick CSF in a 10AFC task. We found that increasing the number of alternatives of the forced-choice task greatly improved the efficiency of CSF assessment in both simulation and psychophysical studies. A quick CSF method based on a 10-letter identification task can assess the CSF with an averaged standard deviation of .10 decimal log unit in less than 2 minutes.
Saliency prediction typically relies on hand-crafted (multiscale) features that are combined in different ways to form a “master” saliency map, which encodes local image conspicuity. Recent improvements to the state of the art on standard benchmarks such as MIT1003 have been achieved mostly by incrementally adding more and more hand-tuned features (such as car or face detectors) to existing models [18,4,22,34]. In contrast, we here follow an entirely automatic data-driven approach that performs a large-scale search for optimal features. We identify those instances of a richly-parameterized bio-inspired model family (hierarchical neuromorphic networks) that successfully predict image saliency. Because of the high dimensionality of this parameter space, we use automated hyperparameter optimization to efficiently guide the search. The optimal blend of such multilayer features combined with a simple linear classifier achieves excellent performance on several image saliency benchmarks. Our models outperform the state of the art on MIT1003, on which features and classifiers are learned. Without additional training, these models generalize well to two other image saliency data sets, Toronto and NUSEF, despite their different image content. Finally, our algorithm scores best of all the 23 models evaluated to date on the MIT300 saliency challenge , which uses a hidden test set to facilitate an unbiased comparison.
The fundamental role of the visual system is to guide behavior in natural environments. In order to optimize information transmission many animals have evolved a non-homogeneous retina and serially sample visual scenes by saccadic eye movements. Such eye movements, however, introduce high-speed retinal motion and decouple external and internal reference frames. Until now, these processes have only been studied with unnatural stimuli, eye movement behavior, and tasks. These experiments confound retinotopic and geotopic coordinate systems and may probe a non-representative functional range. Here we develop a real-time gaze-contingent display with precise spatio-temporal control over high-definition natural movies. In an active condition, human observers freely watched nature documentaries and indicated the location of periodic narrow-band contrast increments relative to their gaze position. In a passive condition under central fixation, the same retinal input was replayed to each observer by updating the video's screen position. Comparison of visual sensitivity between conditions revealed three mechanisms which the visual system has adapted to compensate for peri-saccadic vision changes. Under natural conditions, we show that reduced visual sensitivity during eye movements can be explained simply by the high retinal speed during a saccade without recourse to an extra-retinal mechanism of active suppression; give evidence for enhanced sensitivity immediately after an eye movement, indicative of visual receptive fields remapping in anticipation of forthcoming spatial structure; and demonstrate that perceptual decisions can be made in world rather than retinal coordinates.
This study investigated how to teach perceptual tasks, that is, classifying fish locomotion, through eye movement modeling examples (EMME). EMME consisted of a replay of eye movements of a didactically behaving domain expert (model), which had been recorded while he executed the task, superimposed onto the video stimulus. Seventy-five students were randomly assigned to one of three conditions: In two experimental conditions (EMME) the model’s eye movements were superimposed onto the video either as a dot or as a spotlight, whereas the control group studied only the videos without the model’s eye movements. In all conditions, students listened to the expert’s verbal explanations. Results showed that both types of EMME guided students’ attention during example study. Subsequent to learning, students performed a classification task for novel test stimuli without any support. EMME improved visual search and enhanced interpretation of relevant information for those novel stimuli compared to the control group; these effects were further moderated by the specific display. Thus, EMME during training can foster learning and improve performance on novel perceptual stimuli.
Interdisciplinary research in human vision and electronic imaging has greatly contributed to the current state of the art in imaging technologies. Image compression and image quality are prominent examples and the progress made in these areas relies on a better understanding of what natural images are and how they are perceived by the human visual system. A key research question has been: given the (statistical) properties of natural images, what are the most efficient and perceptually relevant image representations, what are the most prominent and descriptive features of images and videos?
We give an overview of how these topics have evolved over the 25 years of HVEI conferences and how they have influenced the current state of the art. There are a number of striking parallels between human vision and electronic imaging. The retina does lateral inhibition, one of the early coders was using a Laplacian pyramid; primary visual cortical areas have orientation- and frequency-selective neurons, the current JPEG standard defines similar wavelet transforms; the brain uses a sparse code, engineers are currently excited about sparse coding and compressed sensing. Some of this has indeed happened at the HVEI conferences and we would like to distill that.
Looking at the right place at the right time is a critical component of driving skill. Therefore, gaze guidance has the potential to become a valuable driving assistance system. In previous work, we have already shown that complex gaze-contingent stimuli can guide attention and reduce the number of accidents in a simple driving simulator. We here set out to investigate whether cues that are simple enough to be implemented in a real car can also capture gaze during a more realistic driving task in a high-fidelity driving simulator. This immediately raises another question, namely how such cues would interfere with the driving task itself.
We used a state-of-the-art, wide-field-of-view driving simulator with an integrated eye tracker. Gaze-contingent warnings were implemented using two arrays of light-emitting diodes horizontally fitted below and above the simulated windshield. Twelve volunteers drove along predetermined routes in the simulated environment populated with autonomous traffic. Warnings were triggered during the approach to half of the intersections, cueing either towards the right or to the left. The remaining intersections were not cued, and served as controls. A preliminary analysis shows that gaze-contingent cues led to a significant shift in gaze position towards the highlighted direction.
Algorithms using “bag of features”-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms, and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.
Patients with hemispatial neglect are severely impaired in orienting their attention to contralesional hemispace. Although motion is one of the strongest attentional cues in humans, it is still unknown how neglect patients visually explore their moving real-world environment.
We therefore recorded eye movements at bedside in 19 patients with hemispatial neglect following acute right hemisphere stroke, 14 right-brain damaged patients without neglect and 21 healthy control subjects. Videos of naturalistic real-world scenes were presented first in a free viewing condition together with static images, and subsequently in a visual search condition. We analyzed number and amplitude of saccades, fixation durations and horizontal fixation distributions. Novel computational tools allowed us to assess the impact of different scene features (static and dynamic contrast, colour, brightness) on patients' gaze.
Independent of the different stimulus conditions, neglect patients showed decreased numbers of fixations in contralesional hemispace (ipsilesional fixation bias) and increased fixation durations in ipsilesional hemispace (disengagement deficit). However, in videos left-hemifield fixations of neglect patients landed on regions with particularly high dynamic contrast. Furthermore, dynamic scenes with few salient objects led to a significant reduction of the pathological ipsilesional fixation bias. In visual search, moving targets in the neglected hemifield were more frequently detected than stationary ones. The top-down influence (search instruction) could neither reduce the ipsilesional fixation bias nor the impact of bottom-up features.
Our results provide evidence for a strong impact of dynamic bottom-up features on neglect patients' scanning behaviour. They support the neglect model of an attentional priority map in the brain being imbalanced towards ipsilesional hemispace, which can be counterbalanced by strong contralateral motion cues. Taking into account the lack of top-down control in neglect patients, bottom-up stimulation with moving real-world stimuli may be a promising candidate for future neglect rehabilitation schemes.
Local spatiotemporal descriptors are being successfully used as a powerful video representation for action recognition. Particularly competitive recognition performance is achieved when these descriptors are densely sampled on a regular grid; in contrast to existing approaches that are based on features at interest points, dense sampling captures more contextual information, albeit at high computational cost. We here combine advantages of both dense and sparse sampling. Once descriptors are extracted on a dense grid, we prune them either randomly or based on a sparse saliency mask of the underlying video. The method is evaluated using two state-of-the-art algorithms on the challenging Hollywood2 benchmark. Classification performance is maintained with as little as 30% of descriptors, while more modest saliency-based pruning of descriptors yields improved performance. With roughly 80% of descriptors of the Dense Trajectories model, we outperform all previously reported methods, obtaining a mean average precision of 59.5%.