Show Tag: visual

Select Other Tags

Li et al. present a purely engineering-based approach to active speaker localization. Their system uses Viola and Jones' object detection algorithm for face detection and cross-correlation for auditory speaker localization.

Humans can orient towards emotional human faces faster than towards neutral human faces.

The time it takes to elicit a visual cortical response plus the time to elicit a saccade from cortex (FEF) is longer than the time it takes for humans to orient towards faces.

Nakano et al. take this as further evidence for a sub-cortical (retinotectal) route of face detection.

Patients with lesions in V1 or striate were found to still be able to discriminate gender and expression of faces.

Neurons in the monkey pulvinar react extremely fast to visually perceived faces (50ms).

The superior colliculus does not receive any signals from short-wavelength cones (S-cones) in the retina.

Nakano et al. presented an image of either a butterfly or a neutral or emotional face to their participants. The stimuli were either grayscale or color-scale images, where color-scale images were isoluminant and only varied in their yellow-green color values. Since information from S-cones does not reach the superior colliculus, these faces were presumably only processed in visual cortex.

Nakano et al. found that their participants reacted to gray-scale emotional faces faster than to gray-scale neutral faces and to gray-scale faces faster than to gray-scale butterflies. Their participants reacted somewhat faster to color-scale faces than to color-scale butterflies, but this effect was much smaller than for gray-scale images. Also, the difference in reaction time to color-scale emotional faces was not significantly different from that to color-scale neutral faces.

Nakano et al. take this as further evidence of sub-cortical face detection and in particular of emotional sub-cortical face detection.

Humans can orient towards human faces faster than towards other visual stimuli (within 100ms).

In many audio-visual localization tasks, humans integrate information optimally.

In some audio-visual time discrimination tasks, humans do not integrate information optimally.

Activity in the auditory cortex is modulated by visual stimuli.

Visual receptive fields in the superficial hamster SC do not vary substantially in RF size with RF eccentricity.

Visual receptive field sizes change in the deep SC with eccentricity, they do not in the superficial hamster SC.

Masking visual face stimuli---ie. presenting faces visually for too short to detect them consciously, then presenting a masking stimulus---can evoke measurable changes in conductance.

Masking visual face stimuli can evoke responses in sc, pulvinar, and amygdala.

The right amygdala responded differently to masked than to unmasked stimuli, while the left did not in Morris et al.'s experiments.

SOM-based algorithms have been used to model several features of natural visual processing.

Miikulainen et al. use their SOM-based algorithms to model the visual cortex.

Miikulainen et al. use a hierarchical version of their SOM-based algorithm to model natural development of visual capabilities.

Retinal waves of spontaneous activity in the retina occur before photoreceptors develop.

They are thought to be involved in setting up the spatial organization of the visual pathway.

LGN and V4 have distinct layers for each eye.

The distinct layers for each eye in LGN and V4 only arise after the initial projections from the retina are made, but, in higher mammals, before birth.

Most neurons in the visual cortex (except v4) are binocular.

Usually, input from one eye is dominant, however.

The distribution of monocular dominance in visual cortex neurons is drastically affected by monocular stimulus deprivation during early development.

Competition appears to be a major factor in organizing the visual system.

Visual capture is weaker for stimuli in the periphery where visual localization is less reliable relative to auditor localization than at the center of the visual field.

Many visual person detection methods use one feature to detect people, create a histogram for the strength of that feature across the image. They then compute a likelihood for a pixel or region by assuming a Gaussian distribution of distances of pixels or histograms belonging to a face. This distribution has been validated in practise (for certain cases).

Person tracking can combine cues from single modalities (like motion and color cues), or from different modalities (like auditory and visual cues).

Since we analyze complex visual scenes in chunks by saccading from one location to another, information about saccades must be used to break the constant stream of data coming from the eyes into chunks belonging to different locations in the visual field.

By contrasting performance in a condition in which their test subjects actually made saccades to that in a condition when only the image in front of their eyes was exchanged, Paradiso et al. showed that explicit information about saccades --- not just the change of visual input itself --- is responsible for resetting visual processing.

While the signal indicating a saccade could be proprioceptive, the timing in Paradiso et al.'s experiments hints at corollary discharge.

Palmer and Ramsey show that lack of awareness of a visual lip stream does not inhibit learning of its relevance for a visual localization task: the subliminal lip stream influences visual attention and affects the subjects' performance.

They also showed that similar subliminal lip streams did not affect the occurrence of the Mc Gurk effect.

Together, this suggests that awareness of a visual stimulus is not always needed to use it for guiding visual awareness, but sometimes it is needed for multisensory integration to occur (following Palmer and Ramsey's definition).

Frassinetti et al. showed that humans detect near-threshold visual stimuli with greater reliability if these stimuli are connected with spatially congruent auditory stimuli (and vice versa).

Response properties in mouse superficial SC neurons are not strongly influenced by experience.

How strongly SC neurons' development depends on experience (and how strongly well they are developed after birth) is different from species to species, so just because the superficial mouse SC is developed at birth, doesn't mean it is in other species (and I believe responsiveness in cats develops with experience).

Response properties of superficial SC neurons is different from those found in mouse V1 neurons.

Response properties of superficial SC neurons are different in different animals.

Search targets which share few features with mutually similar distractors surrounding them are said to `pop out': it seems to require hardly any effort to identify them and search for them is very fast.

Search targets that share most features with their surrounding, on the other hand, require much more time time be identified.

Gottlieb et al. found that the most salient and the most task-relevant visual stimuli evoke the greatest response in LIP.

Laurenti et al. found in a audio-visual color identification task that redundant, congruent, semantic auditory information (the utterance of a color word) can decrease latency in response to a stimulus (color of a circle displayed to the subject). Incongruent semantic visual or auditory information (written or uttered color word) can increase response latency. However, congruent semantic visual information (written color word) does not decrease response latency.

The enhancements in response latencies in Laurenti et al.'s audio-visual color discrimination experiments were greater (response latencies were shorter) than predicted by the race model.

Rucci et al. present a robotic system based on their neural model of audiovisual localization.

There are a number of approaches for audio-visual localization. Some with actual robots, some just as theoretical ANN or algorithmic models.

Rucci et al. present an algorithm which performs auditory localization and combines auditory and visual localization in a common SC map. The mapping between the representations is learned using value-dependent learning.

O'Regan and Noë speak of the geometric laws that govern the relationship between moving the eyes and body and the change of an image in the retina.

The geometry of the changes—straight lines becoming curves on the retina when an object moves in front of the eyes—are not accessible to the visual system, initially, because nothing tells the brain about the spatial relations between photoreceptors in the retina.

O'Regan and Noë claim that the structure of the laws governing visual sensory-motor contingencies is different from the structure of other sensory-motor contingencies and that this difference gives rise to different phenomenology.

Saccades evoked by electric stimulation of the deep SC can be deviated towards the target of visual spatial attention. This is the case even if the task forbids a saccade towards the target of visual spatial attention.

Activation build-up in build-up neurons is modulated by spatial attention.

There has been extensive research into the phenomenon that is visually guided flight in flies.

Polarization of light is used by flies for long-range orientation wrt. sun.

The topographic map of visual space in the sSC is retinotopic.

The motor map of in the dSC is retinotopic.

Both visual and auditory neurons in the deep SC usually prefer moving stimuli and are direction selective.

The range of directions deep SC neurons are selective for is usally wide.

Rucci et al. model learning of audio-visual map alignment in the barn owl SC. In their model, projections from the retina to the SC are fixed (and visual RFs are therefore static) and connections from ICx are adapted through value-dependent learning.

Neurons at later stages in the hierarchy of visual processing extract very complex features (like faces).

Spatial attention raises baseline activity in neurons whose RF are where the attention is even without a visual stimulus (in visual cortex).

Unilateral lesions in brain areas associated with attention can lead to visuospatial neglect; the failure to consider anything within a certain region of the visual field. In extreme cases this can mean that patients e.g. only read from one side of a book.

Kastner and Ungerleider propose that the top-down signals which lead to the effects of visual attention originate from brain regions outside the visual cortex.

Regions lesions of which can induce visuospatial neglect include

  • the parietal lobe, in particular the inferior part,
  • temporo-parietal junction,
  • the anterior cingulate cortex,
  • basal ganglia,
  • thalamus,
  • the pulvinar nucleus.

Spatial attention can enhance the activity of SC neurons whose receptive fields overlap the attended region

There are visuo-somatosensory neurons in the putamen.

Graziano and Gross found visuo-somatosensory neurons in those regions of the putamen which code for arms and the face in somatosensory space.

Visuo-somatosensory neurons in the putamen with somatosensory RFs in the face are very selective: They seem to respond to visual stimuli consistent with an upcoming somatosensory stimulus (close-by objects approaching to the somatosensory RFs of the neurons).

Graziano and Gross report on visuo-somatosensory cells in the putamen in which remapping seems to be happening: Those cells responded to visual stimuli only when the animal could see the arm in which the somatosensory RF of those cells was located.

There are reports of highly selective, purely visual cells in the putamen. One report is of a cell which responded best to a human face.

Responses of visuo-tactile responses in Brodman area 7b, the ventral intraparietal area, and inferior premotor area 6 are similar to those found in the putamen.

Cells in MST respond to and are selective to optic flow.

Some cells in MST are multisensory.

Visuo-vestibular cells in MST perform multisensory integration in the sense that their response to multisensory stimuli is different from their response to either of the uni-sensory cues.

Visuo-vestibular cells tend to be selective for visual and vestibular self-motion cues which indicate motion in the same direction.

The responses of some visuo-vestibular cells were enhanced, that of others was depressed by combined visuo-vestibular cues.

Visual information seems to override vestibular information in estimating heading direction.

FAES is not exclusively auditory.

AEV is partially, but not consistently, retinotopic.

Receptive fields in AEV tend to be smaller for cells with RF centers at the center of the visual field than for those with RF centers in the periphery.

AEV is not exclusively (but mostly) visual.

RFs in AEV are relatively large.

Casey et al. use their ANN in a robotic system for audio-visual localization.

Casey et al. focus on making their system work in real time and with complex stimuli and compromise on biological realism.

(Retinal) visual input to the left SC mainly originates in the retina of the right eye and vice-versa.

The visual and auditory maps in the deep SC are in spatial register.

Auditory receptive fields tend to be greater and contain visual receptive fields in the deep SC of the owl.

The superficial SC of the owl is strongly audio-visual.

The receptive fields of certain neurons in the cat's deep SC shift when the eye position is changed. Thus, the map of auditory space in the deep SC is temporarily realigned to stay in register with the retinotopic map.

In an fMRI experiment, Schneider found that spatial attention and switching between modes of attention (attending to moving or to colored stimuli) strongly affected SC activation, but results for feature-based attention were inconclusive.

The fact that Schneider did not find conclusive evidence for modulation of neural responses by feature-based attention might be related to the fact that the superficial SC does not seem to receive color-based information and deep SC seems to receive color-based information only via visual cortex.

An auditory and a visual stimulus, separated in time, may be perceived as one audio-visual stimulus, seemingly occurring at the same point in time.

If an auditory and a visual stimulus are close together, spatially, then they are more likely perceived as one cross-modal stimulus than if they are far apart—even if they are separated temporally.

In a sensorimotor synchronization task, Aschersleben and Bertelson found that an auditory distractor biased the temporal perception of a visual target stimulus more strongly than the other way around.

Sanchez-Riera et al. use a probabilistic model for audio-visual active speaker localization on a humanoid robot (the Nao robot).

Sanchez-Riera et al. use the Bayesian information criterion to choose the number of speakers in their audio-visual active speaker localization system.

Sanchez-Riera et al. use the Waldboost face detection system for visual processing.

Yan et al. present a system which uses auditory and visual information to learn an audio-motor map (in a functional sense) and orient a robot towards a speaker. Learning is online.

Yan et al. use the standard Viola-Jones face detection algorithm for visual processing.

Viola and Jones presented a fast and robust object detection system based on

  1. a computationally fast way to extract features from images,
  2. the AdaBoost machine learning algorithm,
  3. cascades of weak classifiers with increasing complexities.

Sanchez-Riera et al. do not report on localization accuracy, but on correct speaker detections.

Li et al. report that, in their experiment, audio-visual active speaker localization is as good as visual active-speaker localization ($\sim 1^\circ$) as long as speakers are within the visual field.

Outside of the visual field, localization varies between $1^\circ$ and $10^\circ$. The authors do not report provide a detailed quantitative evaluation of localization accuracy.

Yan et al. explicitly do not integrate auditory and visual localization. Given multiple visual and an auditory localization, they associate the auditory localization with that visual localization which is closest, using the visual localization as the localization of the audio-visual object.

In determining the position of the audio-visual object, Yan et al. handle the possibility that the actual source of the stimulus has only been heard, not seen. They decide whether that is the case by estimating the probability that the auditory localization belongs to any of the detected visual targets and comparing to the baseline probability that the auditory target has not been detected, visually.

Yan et al. do not evaluate the accuracy of audio-visual localization.

Voges et al. present an engineering approach to audio-visual active speaker localization.

Voges et al. use a difference image to detect and localize moving objects (humans).

Voges et al. use the strength of the visual detection signal (the peak value of the column-wise sum of the difference image) as a proxy for the confidence of visual detection.

They use visual localization whenever this signal strength is above a certain threshold, and auditory localization if it is below that threshold.

In Vogel et al.'s system, auditory localization serves as a backup in case visual localization fails, and for disambiguation in case more than one visual target is detected.

Voges et al. do not evaluate the accuracy of audio-visual localization.

Aarabi present a system for audio-visual localization in azimuth and depth which they demonstrate in an active-speaker localization task.

Aarabi choose (adaptive) difference images for visual localization to avoid relying on domain knowledge.

Aarabi use ITD (computed using cross-correlation) and ILD in an array of 3 microphones for auditory localization.

Kushal et al. present an engineering approach to audio-visual active speaker localization.

Kushal et al. do not evaluate the accuracy of audio-visual localization quantitatively. They do show a graph for visual-only, audio-visual, and audio-visual and temporal localization during one test run. That graph seems to indicate that multisensory and temporal integration prevent misdetections—they do not seem to improve localization much.

Kushal et al. use an EM algorithm to integrate audio-visual information for active speaker localization statically and over time.

Studies on audio-visual active speaker localization usually do not report on in-depth evaluations of audio-visual localization accuracy. The reason is, probably, that auditory information is only used as a backup for cases when visual localization fails or for disambiguation in case visual information is not sufficient to tell which of the visual targets is the active speaker.

When visual detection succeeds, it is usually precise enough.

Therefore, active speaker localization is probably a misnomer. It should be called active speaker identification.

Humans adapt to an auditory scene's reverberation and noise conditions. They use visual scene recognition to recall reverberation and noise conditions of familiar environments.

A system that stores multiple trained speech recognition models for different environments and retrieves them guided by visual scene recognition has improved speech recognition in reverberated and noisy environments.

Different types of retinal ganglion cells project to different lamina in the zebrafish optic tectum.

The lamina a retinal ganglion cell projects to in the zebrafish optic tectum does not change in the fish's early development. This is in contrast with other animals.

However, the position within the lamina does change.

SC receives input and represents all sensory modalities used in phasic orienting: vision, audition, somesthesis (haptic), nociceptic, infrared, electoceptive, magnetic, and ecolocation.

Weir and Suver review experiments on the visual system of flies, specifically into the dendritic and network properties of the VS and HS systems which respond to apparent motion in the vertical and horizontal planes, respectively.

Stroop presented color words which were either presented in the color they meant (congruent) or in a different (incongruent) color. He asked participants to name the color in which the words were written and observed that participants were faster in naming the color when it was congruent than when it was incongruent with the meaning of the word.

The Stroop test has been used to argue that reading is an automatic task for proficient readers.

Fixating some point in space enhances spoken language understanding if the words come from that point in space. Fixating a visual stream showing lips consistent with the utterances, this effect is strongest, but it also works if the visual display is random. The effect is also enhanced if fixation is combined with some form of visual task which is complex enough.

Fixating at some point in space can impede language understanding if the utterance do not emanate from the focus of visual attention and there are auditory distractors which do.

Goldberg and Wurtz found that neurons in the superficial SC respond more vigorously to visual stimuli in their receptive field if the current task is to make a saccade to the stimuli.

Responses of superficial SC neurons do not depend solely to intrinsic stimulus properties.

Some task-dependency in representations may arise from embodied learning where actions bias experiences being learned from.

Conversely, the narrow range of disparities reflected in disparaty-selective cells in visual cortex neurons might be due to goal-directed feature learning.

The SC is multisensory: it reacts to visual, auditory, and somatosensory stimuli. It does not only initiate gaze shifts, but also other motor behaviour.

The deeper levels of SC are the targets of projections from cortex, auditory, somatosensory and motor systems in the brain.

Moving the eyes shifts the auditory and somatosensory maps in the SC.

(Some) SC neurons in the newborn cat are sensitive to tactile stimuli at birth, to auditory stimuli a few days postnatally, and to visual stimuli last.

Visual responsiveness develops in the cat first from top to bottom in the superficial layers, then, after a long pause, from top to bottom in the lower layers.

The basic topography of retinotectal projections is set up by chemical markers. This topography is coarse and is refined through activity-dependent development.

There's a retinotopic, polysynaptic pathway from the SC through LGN.

Kao et al. did not find visually responsive neurons in the deep layers of the cat SC within the first three postnatal weeks.

Overt visual function can be observed in developing kittens at the same time or before visually responsive neurons can first be found in the deep SC.

Santangelo and Macaluso describe typical experiments for studying visual attention.

Frontal eye fields (FEF) and intraparietal sulcus (IPS) have been associated with voluntary orienting of visual attention.

Santangelo and Macaluso provide a rewiew on the recent literature on visual and auditory attention.

Frontoparietal regions play a key role in spatial orienting in unisensory studies of visual and auditory attention.

There seems to be also modality-specific attention which globally de-activates attention in one modality and activates it in the other.

As a computer scientist I would call de-activating one modality completely a special case of selective attention in that modality.

Localized auditory cues can exogenously orient visual attention.

Santangelo and Macaluso state that multisensory integration and attention are probably separate processes.

Maybe attention controls whether or not multi-sensory integration (MSI) happens at all (at least in SC)? That would be in line with findings that without input from AES and rLS, there's no MSI.

Are AES and rLS cat homologues to the regions cited by Santangelo and Macalluso as regions responsible for auditory and visual attention?

Task-irrelevant visual cues do not affect visual orienting (visual spatial attention). Task-irrelevant auditory cues, however, seem to do so.

Santangelo and Macaluso suggest that whether or not the effects of endogenous attention dominate the ones of bottom-up processing (automatic processing) depends on semantic association, be it linguistic or learned association (like dogs and barking, cows and mooing).

Santangelo and Macaluso state that "the same frontoparietal attention control systems are ... activated in spatial orienting tasks for both the visual and auditory modality..."

De Kamps and van der Velde argue for combinatorial productivity and systematicity as fundamental concepts for cognitive representations. They introduce a neural blackboard architecture which implements these principles for visual processing and in particular for object-based attention.

De Kamps and van der Velde use their blackboard architecture for two very different tasks: representing sentence structure and object attention.

Deco and Rolls introduce a system that uses a trace learning rule to learn recognition of more and more complex visual features in successive layers of a neural architecture. In each layer, the specificity of the features increases together with the receptive fields of neurons until the receptive fields span most of the visual range and the features actually code for objects. This model thus is a model of the development of object-based attention.

Jack and Thurlow found that the degree to which a puppet resembled an actual speaker (whether it had eyes and a nose, whether it had a lower jaw moving with the speech etc.) and whether the lips of an actual speaker moved in synch with heard speech influenced the strength of the ventriloquism effect.

In one of their experiments, Warren et al. had their subjects localize visual or auditory components of visual-auditory stimuli (videos of people speaking and the corresponding sound). Stimuli were made compelling' by playing video and audio in sync anduncompelling' by introducing a temporal offset.

They found that their subjects performed as under a unity assumptions'' when told they would perceive cross-sensory stimuli, and when the stimuli were `compelling' and under a lowunity assumption'' when they were told there could be separate auditory or visual stimuli and/or the stimuli were made `uncompelling'.

Bertelson et al. did not find a shift of sound source localization due to manipulated endogenous visual spatial attention—localization was shifted only due to (the salience of) light flashes which would induce (automatic, mandatory) exogenous attention.

The deeper levels of SC receive virtually no primary visual input (in cats and ferrets).

Visual receptive fields in the superficial monkey SC do vary substantially in RF size with RF eccentricity.

In some animals, receptive field sizes do and in some they don't change substantially with RF excentricity.

Ocular dominance stripes are stripes in visual brain regions in which retinal projections of one eye or the other terminate alternatingly.

Bell et al. found that playing a sound before a visual target stimulus did not increase activity in the neurons they monitored for long enough to lead to (neuron-level) multisensory integration.

Seeing someone say 'ba' and hearing them say 'ga' can make one perceive them as saying 'da'. This is called the `McGurk effect'.

Ernst and Banks show that humans combine visual and haptic information optimally in a height estimation task.

Before a saccade is made, the region that will be the target of that saccade is perceived with higher contrast and visual contrast.

Different parts of the visual field feed into the cortical and subcortical visual pathways more or less strongly in humans.

The nasal part of the visual field feeds more into the cortical pathway while the peripheral part feeds more into the sub-cortical pathway.

In one experiment, newborns reacted to faces only if they were (exclusively) visible in their peripheral visual field, supporting the theory that the sub-cortical pathway of visual processing plays a major role in orienting towards faces in newborns.

It makes sense that sub-cortical visual processing uses peripheral information more than cortical processing:

  • sub-cortical processing is concerned with latent monitoring of the environment for potential dangers (or conspecifiics)
  • sub-cortical processing is concerned with watching the environment and guiding attention in cortical processing.

Auditory signals gain relevance during saccades as visual perception is unreliable during saccades.

It would therefore be a good candidate for feedback if saccade control is closed-loop.

If visual cues were absolutely necessary for the formation of an auditory space map, then no auditory space map should develop without visual cues. Since an auditory space map develops also in blind(ed) animals, visual cues cannot be strictly necessary.

Many localized perceptual events are either only visual or only auditory. It is therefore not plausible that only audio-visual percepts contribute to the formation of an auditory space map.

Visual information plays a role, but does not seem to be necessary for the formation of an auditory space map.

The auditory space maps developed by animals without patterned visual experience seem to be degraded only in some species (in guinea pigs and barn owls, but not in ferrets or cats).

Visual input does seem to be necessary to ensure spatial audio-visual map-register.

Visual localization has much greater precision and reliability than auditory localization. This seems to be one reason for vision guiding hearing (in this particular context) and not the other way around.

It is unclear and disputed whether visual dominance in adaptation is hard-wired or a result of the quality of respective stimuli.

Most of the multi-sensory neurons in the (cat) SC are audio-visual followed by visual-somatosensory, but all other combinations can be found.

One reason for specifically studying multi-sensory integration in the (cat) SC is that there is a well-understood connection between input stimuli and overt behavior.

Feldman gives a functional explanation of the stable world illusion, but he does not seem to explain "Subjective Unity of Perception".

Feldman states that enough is known about what he calls "Visual Feature Binding", so as not to call it a problem anymore.

Feldman explains Visual Feature Binding by the fact that all the features detected in the fovea usually belong together (because it is so small), and through attention. He cites Chikkerur et al.'s Bayesian model of the role of spatial and object attention in visual feature binding.

Feldman states that "Neural realization of variable binding is completely unsolved".

55% of neocortex are visual.

Neurons at low stages in the hierarchy of visual processing extract simple, localized features.

Color opponency and center-surround oppenency arise first in LGN.

The visual system (of primates) contains a number of channels for different types of visual information:

  • color
  • shape
  • motion
  • texture
  • 3D

Separating visual processing into channels by the kind of feature it is based on is beneficial for efficient coding: feature combinations can be coded combinatorially.

There are very successful solutions to isolated problems in computer vision (CV). These solutions are flat, however in the sense that they are implemented in a single process from feature extraction to information interpretation. A CV system based on such solutions can suffer from redundant computation and coding. Modeling a CV

Nearly all projections from the retinae go through LGN.

All visual areas from V1 to V2 and MT are retinotopic.

The ventral pathway of visual processing is weakly retinotopically organized.

The complexity of features (or combinations of features) neurons in the ventral pathway react to increases to object level. Most neurons react to feature combinations which are below object level, however.

The dorsal pathway of visual processing consists of areas MST (motion area), and visual areas in the posterior parietal cortex (PPC).

The complexity of motion patterns neurons in the dorsal pathway are responsive to increases along the pathway. This is similar to neurons in the ventral pathway which are responsive to progressively more complex feature combinations.

Receptive fields in the dorsal pathway of visual processing are less retinotopic and more head-centered.

Parvocellular ganglion cells are color sensitive, have small receptive fields and are focused on foveal vision.

Magnocellular ganglion cells have lower spatial and higher temporal resolution than parvocellular cells.

There are shortcuts between the levels of visual processing in the visual cortex.

Certain neurons in V1 are sensitive to simple features:

  • edges,
  • gratings,
  • line endings,
  • motion,
  • color,
  • disparity

Certain receptive fields in the cat striate cortex can be modeled reasonably well using linear filters, more specifically Gabor filters.

Simple cells are sensitive to the phase of gratings, whereas complex cells are not and have larger receptive fields.

Some cells in V1 are sensitive to binocular disparity.

LIP has been suggested to contain a saliency map of the visual field, to guide visual attention, and to decide about saccades.

Neurons in the superficial SC are almost exclusively visual in most species.

The receptive fields of LGN cells can be described as either an excitatory area inside an inhibitory area or the reverse.

The receptive field properties of neurons in the cat striate cortex have been modeled as linear filters. In particular three types of linear filters have been proposed:

  • Gabor filters,
  • filters that based on second differentials of Gaussians functions,
  • difference of Gaussians filters.

Hawken and Parker studied the response patterns of a large number of cells in the cat striate cortex and found that Gabor filters, filters which are second differential of Gaussian functions, and difference-of-Gaussians filters all model these response patterns well, quantitatively.

They found, however, that difference-of-Gaussians filters strongly outperformed the other models.

Difference-of-Gaussians filters are parsimonious candidates for modeling the receptive fields of striate cortex cells, because the kind of differences of Gaussians used in striate cortex (differences of Gaussians with different peak locations) can themselves be computed linearly from differences of Gaussians which model receptive fields of LGN cells (where the peaks coincide), which provide the input to the striate cortex.

Both simple and complex cells' receptive fields can be described using difference-of-Gaussians filters.

"Natural images are statistically redundant."

It seems as though the primates' trichromatic visual system is well-suited to capture the distribution of colors in natural systems.

By optimizing sparseness (or coding efficiency) of functions for representing natural images, one can arrive at tuning functions similar to those found in in simple cells. They are

  • spatially localized
  • oriented
  • band-pass filters with different spatial frequencies.

LGN cells respond whitened---ie. efficiently---to natural images, but they respond non-white to white noise, eg. They are thus well-adapted to natural images from the efficient coding point of view.

One hypothesis about early visual processing is that it tries to preserve (and enhance) as much information about the visual stimuli (with as little effort) as possible. Findings about efficiency in visual processing seem to validate this hypothesis.

A complete theory of early visual processing would need to address more aspects than coding efficiency, optimal representation and cleanup. Tasks and implementation would have to be taken into account.

Saccade targets tend to be the centers of objects.

When reading, preferred viewing locations (PVL)—the centers of the distributions of fixation targets---are typically located slightly left of the center of words.

When reading, the standard deviation of the distribution of fixation targets within a word increases with the distance between the start and end of a saccade.

Pajak and Nuthmann found that saccade targets are typically at the center of objects. This effect is strongest for large objects.

Early visual neurons (eg. in V1) do not seem to encode probabilities.

I'm not so sure that early visual neurons don't encode probabilities. The question is: which probabilities do they encode? That of a line being there?

The optic nerve does not have the bandwidth to transmit all the light receptors' activities. Some compression occurs already in the eye.

Magnocellular ganglion cells have large receptive fields.

The M-stream of visual processing is formed by magnocellular ganglion cells, the P-stream by parvocellular ganglion cells.

The M-stream is thought to deal with motion detection and analysis, while the P-stream seems to do be involved in processing color and form.

Cells in inferotemporal cortex are highly selective to the point where they approach being grandmother cells.

There are cells in inferotemporal cortex which respond to (specific views on / specific parts of) faces, hands, walking humans and others.

Predictive coding and biased competition are closely related concepts. Spratling combines them in his model and uses it to explain visual saliency.

The retina projects to the superficial SC directly.

Mishkin et al. proposed a theory suggesting that visual processing runs in two pathways: the what' and thewhere' pathway.

The `what' pathway runs ventrally from the striate and prestriate to the inferior temporal cortex. This pathway is supposed to deal with the identification of objects.

The `where' pathway runs dorsally from striate and prestriate to inferior parietal pathway. This pathway is supposed to deal with the localization of objects.

Mishkin et al. already recognized the question of how and where the information carried in the different pathways could be integrated. They speculated that some of the targets of projections from the pathways, eg. in the limbic or system or the frontal lobe, could be convergence sites. Mishkin et al. stated that some preliminary results suggest that the hippocampal formation might play an important role.

Because of the distance between the focal point of the lens and the point of rotation of the biological eye, depth information can be inferred from shifts of objects' projections on the retina during eye movements.

Von der Malsburg introduces a simple model of self-organization which explains the organization of direction-sensitive cells in the human visual cortex.

Georgopoulos et al. introduced the notion of population coding and population vector readout.

Pitti et al. claim that their model explains preference for face-like visual stimuli and that their model can help explain imitation in newborns. According to their model, the SC would develop face detection through somato-visual integration.