Interdisciplinary research in human vision and electronic imaging has greatly contributed to the current state of the art in imaging technologies. Image compression and image quality are prominent examples and the progress made in these areas relies on a better understanding of what natural images are and how they are perceived by the human visual system. A key research question has been: given the (statistical) properties of natural images, what are the most efficient and perceptually relevant image representations, what are the most prominent and descriptive features of images and videos?We give an overview of how these topics have evolved over the 25 years of HVEI conferences and how they have influenced the current state of the art. There are a number of striking parallels between human vision and electronic imaging. The retina does lateral inhibition, one of the early coders was using a Laplacian pyramid; primary visual cortical areas have orientation- and frequency-selective neurons, the current JPEG standard defines similar wavelet transforms; the brain uses a sparse code, engineers are currently excited about sparse coding and compressed sensing. Some of this has indeed happened at the HVEI conferences and we would like to distill that.
Looking at the right place at the right time is a critical component of driving skill. Therefore, gaze guidance has the potential to become a valuable driving assistance system. In previous work, we have already shown that complex gaze-contingent stimuli can guide attention and reduce the number of accidents in a simple driving simulator. We here set out to investigate whether cues that are simple enough to be implemented in a real car can also capture gaze during a more realistic driving task in a high-fidelity driving simulator. This immediately raises another question, namely how such cues would interfere with the driving task itself.
We used a state-of-the-art, wide-field-of-view driving simulator with an integrated eye tracker. Gaze-contingent warnings were implemented using two arrays of light-emitting diodes horizontally fitted below and above the simulated windshield. Twelve volunteers drove along predetermined routes in the simulated environment populated with autonomous traffic. Warnings were triggered during the approach to half of the intersections, cueing either towards the right or to the left. The remaining intersections were not cued, and served as controls. A preliminary analysis shows that gaze-contingent cues led to a significant shift in gaze position towards the highlighted direction.
This study investigated how to teach perceptual tasks, that is, classifying fish locomotion, through eye movement modeling examples (EMME). EMME consisted of a replay of eye movements of a didactically behaving domain expert (model), which had been recorded while he executed the task, superimposed onto the video stimulus. Seventy-five students were randomly assigned to one of three conditions: In two experimental conditions (EMME) the model’s eye movements were superimposed onto the videoeither as a dot or as a spotlight, whereas the control group studied only the videos without the model’s eye movements. In all conditions, students listened to the expert’s verbal explanations. Results showed that both types of EMME guided students’ attention during example study. Subsequent to learning, students performed a classification task for novel test stimuli without any support. EMME improved visual search and enhanced interpretation of relevant information for those novel stimuli compared tothe control group; these effects were further moderated by the specific display. Thus, EMME during training can foster learning and improve performance on novel perceptual stimuli.
The fundamental role of the visual system is to guide behavior in natural environments. In order to optimize information transmission many animals have evolved a non-homogeneous retina and serially sample visual scenes by saccadic eye movements. Such eye movements, however, introduce high-speed retinal motion and decouple external and internal reference frames. Until now, these processes have only been studied with unnatural stimuli, eye movement behavior, and tasks. These experiments confound retinotopic and geotopic coordinate systems and may probe a non-representative functional range. Here we develop a real-time gaze-contingent display with precise spatio-temporal control over high-definition natural movies. In an active condition, human observers freely watched nature documentaries and indicated the location of periodic narrow-band contrast increments relative to their gaze position. In a passive condition under central fixation, the same retinal input was replayed to each observer by updating the video's screen position. Comparison of visual sensitivity between conditions revealed three mechanisms which the visual system has adapted to compensate for peri-saccadic vision changes. Under natural conditions, we show that reduced visual sensitivity during eye movements can be explained simply by the high retinal speed during a saccade without recourse to an extra-retinal mechanism of active suppression; give evidence for enhanced sensitivity immediately after an eye movement, indicative of visual receptive fields remapping in anticipation of forthcoming spatial structure; and demonstrate that perceptual decisions can be made in world rather than retinal coordinates.
We here model peripheral vision in a compressed sensing framework as a strategy of optimally guessing what stimulus corresponds to a sparsely encoded peripheral representation, and find that typical letter-crowding effects naturally arise from this strategy. The model is simple as it consists of only two convergence stages. We apply the model to the problem of crowding effects in reading. First, we show a few instructive examples of letter images that were reconstructed from encodings with different convergence rates. Then, we present an initial analysis of how the choice of model parameters affects the distortion of isolated and flanked letters.
We here study the predictability of eye movements when viewing high-resolution natural videos. We use three recently published gaze data sets that contain a wide range of footage, from scenes of almost still-life character to professionally made, fast-paced advertisements and movie trailers. Inter-subject gaze variability differs significantly between data sets, with variability being lowest for the professional movies. We then evaluate three state-of-the-art saliency models on these data sets. A model that is based on the invariants of the structure tensor and that combines very generic, sparse video representations with machine learning techniques outperforms the two reference models; performance is further improved for two data sets when the model is extended to a perceptually inspired colour space. Finally, a combined analysis of gaze variability and predictability shows that eye movements on the professionally made movies are the most coherent (due to implicit gaze-guidance strategies of the movie directors), yet the least predictable (presumably due to the frequent cuts). Our results highlight the need for standardized benchmarks to comparatively evaluate eye movement prediction algorithms.
Our study explores the potential of gaze guidance in driving and analyses eye movements and driving behaviour in safety-critical situations. We collected eye movements from subjects instructed to drive pre-determined routes in a driving simulator. While driving, the subjects performed various cognitive tasks designed to divert their attention away from the road. The 30 subjects were equally divided in two groups, a control and a gaze guidance group. For the latter, potentially dangerous events, such as a pedestrian suddenly crossing the street, were highlighted with temporally transient gaze-contingent cues, which were triggered if the subject did not look at the pedestrian. For the group that drove with gaze guidance, eye movements have a reduced variability after the gaze-capturing event and shorter reaction times to it. More importantly, gaze guidance leads to a safer driving behaviour and a significantly reduced number of collisions.
PDF ((c) ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Interactive Intelligent Systems, January 2012.)
Patients with hemispatial neglect are severely impaired in orienting their attention to contralesional hemispace. Although motion is one of the strongest attentional cues in humans, it is still unknown how neglect patients visually explore their moving real-world environment.
We therefore recorded eye movements at bedside in 19 patients with hemispatial neglect following acute right hemisphere stroke, 14 right-brain damaged patients without neglect and 21 healthy control subjects. Videos of naturalistic real-world scenes were presented first in a free viewing condition together with static images, and subsequently in a visual search condition. We analyzed number and amplitude of saccades, fixation durations and horizontal fixation distributions. Novel computational tools allowed us to assess the impact of different scene features (static and dynamic contrast, colour, brightness) on patients' gaze.
Independent of the different stimulus conditions, neglect patients showed decreased numbers of fixations in contralesional hemispace (ipsilesional fixation bias) and increased fixation durations in ipsilesional hemispace (disengagement deficit). However, in videos left-hemifield fixations of neglect patients landed on regions with particularly high dynamic contrast. Furthermore, dynamic scenes with few salient objects led to a significant reduction of the pathological ipsilesional fixation bias. In visual search, moving targets in the neglected hemifield were more frequently detected than stationary ones. The top-down influence (search instruction) could neither reduce the ipsilesional fixation bias nor the impact of bottom-up features.
Our results provide evidence for a strong impact of dynamic bottom-up features on neglect patients' scanning behaviour. They support the neglect model of an attentional priority map in the brain being imbalanced towards ipsilesional hemispace, which can be counterbalanced by strong contralateral motion cues. Taking into account the lack of top-down control in neglect patients, bottom-up stimulation with moving real-world stimuli may be a promising candidate for future neglect rehabilitation schemes.
Since visual attention-based computer vision applications have gained popularity, ever more complex, biologically-inspired models seem to be needed to predict salient locations (or interest points) in naturalistic scenes. In this paper, we explore how far one can go in predicting eye movements by using only basic signal processing, such as image representations derived from efficient coding principles, and machine learning. To this end, we gradually increase the complexity of a model from simple single-scale saliency maps computed on grayscale videos to spatio-temporal multiscale and multispectral representations. Using a large collection of eye movements on high-resolution videos, supervised learning techniques fine-tune the free parameters whose addition is inevitable with increasing complexity. The proposed model, although very simple, demonstrates significant improvement in predicting salient locations in naturalistic videos over four selected baseline models and two distinct data labelling scenarios.
Local spatiotemporal descriptors are being successfully used as a powerful video representation for action recognition. Particularly competitive recognition performance is achieved when these descriptors are densely sampled on a regular grid; in contrast to existing approaches that are based on features at interest points, dense sampling captures more contextual information, albeit at high computational cost. We here combine advantages of both dense and sparse sampling. Once descriptors are extracted on a dense grid, we prune them either randomly or based on a sparse saliency mask of the underlying video. The method is evaluated using two state-of-the-art algorithms on the challenging Hollywood2 benchmark. Classification performance is maintained with as little as 30% of descriptors, while more modest saliency-based pruning of descriptors yields improved performance. With roughly 80% of descriptors of the Dense Trajectories model, we outperform all previously reported methods, obtaining a mean average precision of 59.5%.
Algorithms using “bag of features”-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms, and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.
A less studied component of gaze allocation in dynamic real-world scenes is the time lag of eye movements in responding to dynamic attention-capturing events. Despite the vast amount of research on anticipatory gaze behaviour in natural situations, such as action execution and observation, little is known about the predictive nature of eye movements when viewing different types of natural or realistic scene sequences. In the present study, we quantify the degree of anticipation during the free viewing of dynamic natural scenes. The cross-correlation analysis of image-based saliency maps with an empirical saliency measure derived from eye movement data reveals the existence of predictive mechanisms responsible for a near-zero average lag between dynamic changes of the environment and the responding eye movements. We also show that the degree of anticipation is reduced when moving away from natural scenes by introducing camera motion, jump cuts, and film-editing.
Contrast sensitivity has been extensively studied over the last decades and there are well-established models of early vision that were derived by presenting the visual system with synthetic stimuli such as sine-wave gratings near threshold contrasts. Natural scenes, however, contain a much wider distribution of orientations, spatial frequencies, and both luminance and contrast values. Furthermore, humans typically move their eyes two to three times per second under natural viewing conditions, but most laboratory experiments require subjects to maintain central fixation. We here describe a gaze-contingent display capable of performing real-time contrast modulations of video in retinal coordinates, thus allowing us to study contrast sensitivity when dynamically viewing dynamic scenes. Our system is based on a Laplacian pyramid for each frame that efficiently represents individual frequency bands. Each output pixel is then computed as a locally weighted sum of pyramid levels to introduce local contrast changes as a function of gaze. Our GPU implementation achieves real-time performance with more than 100 fps on high-resolution video (1920 by 1080 pixels) and a synthesis latency of only 1.5 ms. Psychophysical data show that contrast sensitivity is greatly decreased in natural videos and under dynamic viewing conditions. Synthetic stimuli therefore only poorly characterize natural vision.
Open-source eye trackers have the potential to bring gaze-controlled applications to a wider audience or even the mass market due to their low cost, and their flexibility and tracking quality are continuously improving. We here present a new portable low-cost head-mounted eye-tracking system based on the open-source ITU Gaze Tracker software. The setup consists of a pair of self-built tracking glasses with attached cameras for eye and scene recording. The software was significantly extended and functionality was added for calibration in space, scene recording, synchronization for eye and scene videos, and offline tracking. Results of indoor and outdoor evaluations show that our system provides a useful tool for low-cost portable eye tracking; the software is publicly available.
We investigate the contribution of local spatio-temporal variation of image intensity to saliency. To measure different types of variation, we use the geometrical invariants of the structure tensor. With a video represented in spatial axes x and y and temporal axis t, the n-dimensional structure tensor can be evaluated for different combinations of axes (2D and 3D) and also for the (degenerate) case of only one axis. The resulting features are evaluated on several spatio-temporal scales in terms of how well they can predict eye movements on complex videos. We find that a 3D structure tensor is optimal: the most predictive regions of a movie are those where intensity changes along all spatial and temporal directions. Among two-dimensional variations, the axis pair yt, which is sensitive to horizontal translation, outperforms xy and xt by a large margin, and is even superior in prediction to two baseline models of bottom-up saliency.
(c) ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Eye tracking research & applications 2010, Austin, Texas, 22-24 March 2010. http://doi.acm.org/10.1145/1743666.1743737
Eye tracking has become cheaper and more robust over the last years. Soon it will be feasible to deploy eye tracking in the mass market. One application area in which an average consumer might benefit from eye tracking is in computer games, where gaze direction can add another dimension of input. Progress in this direction will also be of high relevance to disabled users who lack the dexterity to control the input modalities traditionally used in computer games. Not only could gaming with gaze be enjoyable in itself, but the virtual world of multi-player games might also be one arena where disabled users could meet non-disableds on an equal footing. However, for a satisfactory gaming experience, it does not suffice to simply replace the mouse with a gaze cursor; usually, changes to the game play will also have to be made. In this paper, we will present an open-source game that we adapted so that it can be controlled by either a mouse or by gaze direction. We will show results from a small tournament that indicate that gaze is an equal if not superior input modality for this game.
For corresponding open-source software, see http://scholar.harvard.edu/mdorr/scholar_software/gaze-controlled-breakout