Imagine a world where machines can see and understand the world the way humans do. Rapid progress in artificial intelligence has led to smartphones that recognize faces, cars that detect pedestrians, and algorithms that suggest diagnoses from clinical images, among many other applications. The success of computer vision is founded on a deep understanding of the neural circuits in the brain responsible for visual processing. This book introduces the neuroscientific study of neuronal computations in visual cortex alongside of the cognitive psychological understanding of visual cognition and the burgeoning field of biologically-inspired artificial intelligence. Topics include the neurophysiological investigation of visual cortex, visual illusions, visual disorders, deep convolutional neural networks, machine learning, and generative adversarial networks among others. It is an ideal resource for students and researchers looking to build bridges across different approaches to studying and building visual systems.
Visual perception involves the rapid formation of a coarse image representation at the onset of visual processing, which is iteratively refined by late computational processes. These early versus late time windows approximately map onto feedforward and feedback processes, respectively. State-of-the-art convolutional neural networks, the main engine behind recent machine vision successes, are feedforward architectures. Their successes and limitations provide critical information regarding which visual tasks can be solved by purely feedforward processes and which require feedback mechanisms. We provide an overview of recent work in cognitive neuroscience and machine vision that highlights the possible role of feedback processes for both visual recognition and beyond. We conclude by discussing important open questions for future research.
Objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. Human recognition in minimal videos is invariably accompanied by full interpretation of the internal components of the video. State-of-the-art deep convolutional networks for dynamic recognition cannot replicate human behavior in these configurations. The gap between human and machine vision demonstrated here is due to critical mechanisms for full spatiotemporal interpretation that are lacking in current computational models.
A longstanding question in sensory neuroscience is what types of stimuli drive neurons to fire. The characterization of effective stimuli has traditionally been based on a combination of intuition, insights from previous studies, and luck. A new method termed XDream (EXtending DeepDream with real-time evolution for activation maximization) combined a generative neural network and a genetic algorithm in a closed loop to create strong stimuli for neurons in the macaque visual cortex. Here we extensively and systematically evaluate the performance of XDream. We use ConvNet units as in silico models of neurons, enabling experiments that would be prohibitive with biological neurons. We evaluated how the method compares to brute-force search, and how well the method generalizes to different neurons and processing stages. We also explored design and parameter choices. XDream can efficiently find preferred features for visual units without any prior knowledge about them. XDream extrapolates to different layers, architectures, and developmental regimes, performing better than brute-force search, and often better than exhaustive sampling of >1 million images. Furthermore, XDream is robust to choices of multiple image generators, optimization algorithms, and hyperparameters, suggesting that its performance is locally near-optimal. Lastly, we found no significant advantage to problem-specific parameter tuning. These results establish expectations and provide practical recommendations for using XDream to investigate neural coding in biological preparations. Overall, XDream is an efficient, general, and robust algorithm for uncovering neuronal tuning preferences using a vast and diverse stimulus space. XDream is implemented in Python, released under the MIT License, and works on Linux, Windows, and MacOS.
Adaptation is a fundamental property of sensory systems that can change subjective experiences in the context of recent information. Adaptation has been postulated to arise from recurrent circuit mechanisms or as a consequence of neuronally intrinsic suppression. However, it is unclear whether intrinsic suppression by itself can account for effects beyond reduced responses. Here, we test the hypothesis that complex adaptation phenomena can emerge from intrinsic suppression cascading through a feedforward model of visual processing. A deep convolutional neural network with intrinsic suppression captured neural signatures of adaptation including novelty detection, enhancement, and tuning curve shifts, while producing aftereffects consistent with human perception. When adaptation was trained in a task where repeated input affects recognition performance, an intrinsic mechanism generalized better than a recurrent neural network. Our results demonstrate that feedforward propagation of intrinsic suppression changes the functional state of the network, reproducing key neurophysiological and perceptual properties of adaptation.
What specific features should visual neurons encode, given the infinity of real-world images and the limited number of neurons available to represent them? We investigated neuronal selectivity in monkey inferotemporal cortex via the vast hypothesis space of a generative deep neural network, avoiding assumptions about features or semantic categories. A genetic algorithm searched this space for stimuli that maximized neuronal firing. This led to the evolution of rich synthetic images of objects with complex combinations of shapes, colors, and textures, sometimes resembling animals or familiar people, other times revealing novel patterns that did not map to any clear semantic category. These results expand our conception of the dictionary of features encoded in the cortex, and the approach can potentially reveal the internal representations of any system whose input can be captured by a generative model.
Rapid and flexible learning during behavioral choices is critical to our daily endeavors and constitutes a hallmark of dynamic reasoning. An important paradigm to examine flexible behavior involves learning new arbitrary associations mapping visual inputs to motor outputs. We conjectured that visuomotor rules are instantiated by translating visual signals into actions through dynamic interactions between visual, frontal and motor cortex. We evaluated the neural representation of such visuomotor rules by performing intracranial field potential recordings in epilepsy subjects during a rule-learning delayed match-to-behavior task. Learning new visuomotor mappings led to the emergence of specific responses associating visual signals with motor outputs in 3 anatomical clusters in frontal, anteroventral temporal and posterior parietal cortex. After learning, mapping selective signals during the delay period showed interactions with visual and motor signals. These observations provide initial steps towards elucidating the dynamic circuits underlying flexible behavior and how communication between subregions of frontal, temporal, and parietal cortex leads to rapid learning of task-relevant choices.
Interictal spikes (IIS) are bursts of neuronal depolarization observed electrographically between periods of seizure activity in epilepsy patients. However, IISs are difficult to characterize morphologically and their effects on neurophysiology and cognitive function are poorly understood. Currently, IIS detection requires laborious manual assessment and marking of electroencephalography (EEG/iEEG) data. This practice is also subjective as the clinician has to select the mental threshold that EEG activity must exceed in order to be considered a spike. The work presented here details the development and implementation of a simple automated IIS detection algorithm. This preliminary study utilized intracranial EEG recordings collected from 7 epilepsy patients, and IISs were marked by a single physician for a total of 1339 IISs across 68 active electrodes. The proposed algorithm implements a simple threshold rule that scans through iEEG data and identifies IISs using various normalization techniques that eliminate the need for a more complex detector. The efficacy of the algorithm was determined by evaluating the sensitivity and specificity of the detector across a range of thresholds, and an approximate optimal threshold was determined using these results. With an average true positive rate of over 98% and a false positive rate of below 2%, the accuracy of this algorithm speaks to its use as a reliable diagnostic tool to detect IISs, which has direct applications in localizing where seizures start, detecting when seizures start, and in understanding cognitive impairment due to IISs. Furthermore, due to its speed and simplicity, this algorithm can be used for real-time detection of IIS that will ultimately allow physicians to study their clinical implications with high temporal resolution and individual adaptation.
Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work on visual search has focused on searching for perfect matches of a target after extensive category-specific training. Here, we show for the first time that humans can efficiently and invariantly search for natural objects in complex scenes. To gain insight into the mechanisms that guide visual search, we propose a biologically inspired computational model that can locate targets without exhaustive sampling and which can generalize to novel objects. The model provides an approximation to the mechanisms integrating bottom-up and top-down signals during search in natural scenes.
The extent to which the details of past experiences are retained or forgotten remains controversial. Some studies suggest massive storage while others describe memories as fallible summary recreations of original events. The discrepancy can be ascribed to the content of memories and how memories are evaluated. Many studies have focused on recalling lists of words/pictures, which lack the critical ingredients of real world memories. Here we quantified the ability to remember details about one hour of real life. We recorded video and eye movements while subjects walked along specified routes and evaluated whether they could distinguish video clips from their own experience from foils. Subjects were minimally above chance in remembering the minutiae of their experiences. Recognition of specific events could be partly explained by a machine-learning model of video contents. These results quantify recognition memory for events in real life and show that the details of everyday experience are largely not retained in memory.
Making inferences from partial information constitutes a critical aspect of cognition. During visual perception, pattern completion enables recognition of poorly visible or occluded objects. We combined psychophysics, physiology, and computational models to test the hypothesis that pattern completion is implemented by recurrent computations and present three pieces of evidence that are consistent with this hypothesis. First, subjects robustly recognized objects even when they were rendered <15% visible, but recognition was largely impaired when processing was interrupted by backward masking. Second, invasive physiological responses along the human ventral cortex exhibited visually selective responses to partially visible objects that were delayed compared with whole objects, suggesting the need for additional computations. These physiological delays were correlated with the effects of backward masking. Third, state-of-the-art feed-forward computational architectures were not robust to partial visibility. However, recognition performance was recovered when the model was augmented with attractor-based recurrent connectivity. The recurrent model was able to predict which images of heavily occluded objects were easier or harder for humans to recognize, could capture the effect of introducing a backward mask on recognition behavior, and was consistent with the physiological delays along the human ventral visual stream. These results provide a strong argument of plausibility for the role of recurrent computations in making visual inferences from partial information.
Pattern recognition involves building a mental model to interpret incoming inputs. This incoming information is often incomplete and the mental model must extrapolate to complete the patterns, a process that is constrained by the statistical regularities in nature. Examples of pattern completion involve identification of objects presented under unfavorable luminance or interpretation of speech corrupted by acoustic noise.
There has been extensive discussion in the literature about the extent to which cortical representations can be described as localist or distributed. Here, we discuss a simple null model that encompasses a family of related architectures describing the transformation of signals throughout the parts of the visual system involved in object recognition. This family of models constitutes a rigorous first approximation to explain the neurophysiological properties of ventral visual cortex. This null model contains both distributed and local representations throughout the entire hierarchy of computations and the responses of individual units are meaningful and interpretable when encoding is adequately defined for each computational stage.
Neurons in the cerebral cortex respond inconsistently to a repeated sensory stimulus, yet they underlie our stable sensory experiences. Although the nature of this variability is unknown, its ubiquity has encouraged the general view that each cell produces random spike patterns that noisily represent its response rate. In contrast, here we show that reversibly inactivating distant sources of either bottom-up or top-down input to cortical visual areas in the alert primate reduces both the spike train irregularity and the trial-to-trial variability of single neurons. A simple model in which a fraction of the pre-synaptic input is silenced can reproduce this reduction in variability, provided that there exist temporal correlations primarily within, but not between, excitatory and inhibitory input pools. A large component of the variability of cortical neurons may therefore arise from synchronous input produced by signals arriving from multiple sources.
Episodic memories are long lasting and full of detail, yet imperfect and malleable. We quantitatively evaluated recollection of short audiovisual segments from movies as a proxy to real-life memory formation in 161 subjects at 15 minutes up to a year after encoding. Memories were reproducible within and across individuals, showed the typical decay with time elapsed between encoding and testing, were fallible yet accurate, and were insensitive to low-level stimulus manipulations but sensitive to high-level stimulus properties. Remarkably, memorability was also high for single movie frames, even one year post-encoding. To evaluate what determines the efficacy of long-term memory formation, we developed an extensive set of content annotations that included actions, emotional valence, visual cues and auditory cues. These annotations enabled us to document the content properties that showed a stronger correlation with recognition memory and to build a machine-learning computational model that accounted for episodic memory formation in single events for group averages and individualsubjects with an accuracy of up to 80%. These results provide initial steps towards the development of aquantitative computational theory capable of explaining the subjective filtering steps that lead to how humans learn and consolidate memories.
The ability to integrate 'omics' (i.e. transcriptomics and proteomics) is becoming increasingly important to the understanding of regulatory mechanisms. There are currently no tools available to identify differentially expressed genes (DEGs) across different 'omics' data types or multi-dimensional data including time courses. We present fCI (f-divergence Cut-out Index), a model capable of simultaneously identifying DEGs from continuous and discrete transcriptomic, proteomic and integrated proteogenomic data. We show that fCI can be used across multiple diverse sets of data and can unambiguously find genes that show functional modulation, developmental changes or misregulation. Applying fCI to several proteogenomics datasets, we identified a number of important genes that showed distinctive regulation patterns. The package fCI is available at R Bioconductor and http://software.steenlab.org/fCI/.