Show Tag: math

Select Other Tags

SOMs can be used as a means of learning principal manifolds.

A model is a substitution of variables in a theory by objects (individuals) which satisfies all the theory's sentences.

Many visual person detection methods use one feature to detect people, create a histogram for the strength of that feature across the image. They then compute a likelihood for a pixel or region by assuming a Gaussian distribution of distances of pixels or histograms belonging to a face. This distribution has been validated in practise (for certain cases).

In order to work with spatial information from different sensory modalities and use it for motor control, coordinate transformation must happen at some point during information processing. Pouget and Sejnowski state that in many instances such transformations are non-linear. They argue that functions describing receptive fields and neural activation can be thought of and used as basis functions for the approximation of non-linear functions such as those occurring in sensory-motor coordinate transformation.

Lee and Mumford interpret the visual pathway in terms of Bayesian belief propagation: each stage in the processing uses output from the one further up as contextual information and output from the one further down as evidence to update its belief and corresponding output.

Each layer thus calculates probabilities of features of the visual display given noisy and ambiguous input.

Lee and Mumford link their theory to resonance and predictive coding.

Hornik et al. define a squashing function as a non-decreasing function $\Psi:\mathbb{R}\rightarrow [0,1]$ whose limit at infinity is 1 and whose limit at negative infinity is 0.

Sun states:

Any amount of detail of a "mechanism" [...] (provided that it is Turing computable) can be described in an algorithm, while it may not be the case that it can be described through mathematical equations (that is to say, algorithms are more expressive).

However, $\mu$-recursive functions are Turing-complete and they can be expressed in mathematical equations.

Actually, I believe most computational cognitive models (which is what Sun writes about) can be expressed in relatively simple, though long, recursive equations.

Neurons' activities are most informative of the value of stimulus properties in the region where their tuning functions are maximal or have maximal slope.

Which of the two regions is the most informative depends on the variability (noise) of the neurons' responses.

Tabareau et al. propose a scheme for a transformation from the topographic mapping in the SC to the temporal code of the saccadic burst generators.

According to their analysis, that code needs to be either linear or logarithmic.

A deep SC neuron which receives enough information from one modality to reliably determine whether a stimulus is in its receptive field does not improve its performance much by integrating information from another modality.

Patton et al. use this insight to explain the diversity of uni-sensory and multisensory neurons in the deep SC.

In hypothesis testing, we usually know that neither the null hypothesis nor the alternative hypothesis can be fully true. They can at best be an approximation to ie. different from reality. However, the procedure of hypothesis testing consists of testing which of the two is more likely to be true given a sample—not which of the two is the better approximation. Thus, strictly speaking, we're usually applying hypothesis testing to problems the theory was not designed for.

Akaike's information criterion is strongly linked to information theory and the maximum likelihood principle.

Ma, Beck, Latham and Pouget argue that optimal integration of population-coded probabilistic information can be achieved by simply adding the activities of neurons with identical receptive fields. The preconditions for this to hold are

  • independent Poisson (or other "Poisson-like") noise in the input
  • identically-shaped tuning curves in input neurons
  • a point-to-point connection from neurons in different populations with identical receptive fields to the same output neuron.

It's hard to unambiguously interpret Ma et al.'s paper, but it seems that, according to Renart and van Rossum, any other non-flat profile would also transmit the information optimally, although the decoding scheme would maybe have to be different.

Renart and van Rossum discuss optimal connection weight profiles between layers in a feed-forward neural network. They come to the conclusion that, if neurons in the input population have broad tuning curves, then Mexican-hat-like connectivity profiles are optimal.

Renart and van Rossum state that any non-flat connectivity profile between input and output layers in a feed-forward network yields optimal transmission if there is no noise in the output.

The model due to Ma et al. is simple and it requires no learning.

Colonius and Diederich argue that deep-SC neurons spiking behavior can be interpreted as a vote for a target rather than a non-target being in their receptive field.

This is similar to Anastasio et al.'s previous approach.

There are a number of problems with Colonius' and Diederich's idea that deep-SC neurons' binary spiking behavior can be interpreted as a vote for a target rather than a non-target being in their RF. First, these neurons' RFs can be very broad, and the strength of their response is a function of how far away the stimulus is from the center of their RFs. Second, the response strength is also a function of stimulus strength. It needs some arguing, but to me it seems more likely that the response encodes the probability of a stimulus being in the center of the RF.

Colonius and Diederich argue that, given their Bayesian, normative model of neurons' response behavior, neurons responding to only one sensory modality outperform neurons responding to multiple sensory modalities.

Colonius' and Diederich's explanation for uni-sensory neurons in the deep SC has a few weaknesses: First, they model the input spiking activity for both the target and the non-target case as Poisson distributed. This is a problem, because the input spiking activity is really a function of the target distance from the center of the RF. Second, they explicitly model the probability of the visibility of a target to be independent of the probability of its audibility.

If SC neurons spiking behavior can be interpreted as a vote for a target rather than a non-target being in their receptive field, then the decisions must be made somewhere else because they then do not take into account utility.

Freeman and Dale discuss three measures for detecting bimodality in an observed probability distribution:

  • The bimodality coefficient (BC),
  • Hartigan's dip statistic (HDS), and
  • Akaike's information criterion between one-component and two-component distribution models (AID).

Measures for detecting bimodality can be used to detect whether psychometric measurements include cases in which behavior was caused by different cognitive processes (like intuitive and rational processing).

According to Freeman and Dale, Hartigan's dip statistic is more robust against skew than either the bimodality coefficent and Akaike's information criterion.

The bimodality coefficient can be unstable with small sample sizes (n<10).

Bimodality measures for probability distributions are affected by

  • distance between modes,
  • proportion (relative gain) of modes, and
  • proportion of skew.

Bimodality measures for probability distributions are affected by

  • distance between modes,
  • proportion (relative gain) of modes, and
  • proportion of skew.

Of the three, Freeman and Dale found distance between modes to have the greatest impact on the measures they chose.

In Freeman and Dale's simulations, Hartigan's dip statistic was the most sensitive in detecting bimodality.

In Freeman and Dale's simulations, Hartigan's dip statistic was strongly influenced by proportion between modes.

In Freeman and Dale's simulations, the bimodality coefficient suffered from interactions between skew and proportion between modes.

According to Freeman and Dale, the bimodality coefficient uses the heuristic that bimodal distributions often are asymmetric which would lead to high skew and low kurtosis.

It therefore makes sense that it may detect false positives for uni-modal distributions with high skew and low kurtosis.

Freeman and Dale `are inclined to recommend' Hartigan's dip statistic to detect bimodality.

Intuitively, Akaike's information criterion between one-component and two-component distribution models (AID) tests whether a one model or another describes the data better, with a penalty for model complexity.

Freeman and Dale found Akaike's information criterion between one-component and two-component distribution models (AID) to be very sensitive to but highly biased towards bimodality.

Pfister et al. recommend using Hartigan's dip statistic and the bimodality coefficient plus visual inspection to detect bimodality.

There is no cost function that SOM learning follows.

In implementing GTM for some specific use case, one chooses a noise model (at least one per data dimension).

Q-learning is guaranteed to converge to an optimal policy $V^*$ (under certain conditions).

The function $\mathcal{Q}$ induces a strategy $V$ which always takes the action $a$ with the highest expected reward.

The Kullback–Leibler divergence is not symmetric, does not satisfy the triangle inequality and is therefore not a metric.

Kullback-Leibler divergence can be used heuristically as a distance between probability distributions.

Kullback-Leibler divergence $D_{KL}(P,Q)$ between probability distributions $P$ and $Q$ can be interpreted as the information lost when approximating $P$ by $Q$.

For discrete probability distributions $P,E$ with the set of outcomes $E$, Kullback-Leibler divergence is defined as $$D_{KL}(P,Q)=\sum_{e\in E} P(e)\log\left(\frac{P(e)}{Q(e)}\right).$$

Computer simulations have been used as early as at least the 1960s to study problems in statistics.

MLE provides an optimal method of reading population codes.

It's hard to implement MLE on population codes using neural networks.

Depending on the application, tuning curves, and noise properties, threshold linear networks calculating population vectors can have similar performance as MLE.

There seems to be a linear relationship between the mean and variance of neural responses in cortex. This is similar to a Poisson distribution where the variance equals the mean, however, the linearity constant does not seem to be one in biology.

Seung and Sempolinsky introduce maximum likelihood estimation (MLE) as one possible mechanism for neural read-out. However, they state that it is not clear whether MLE can be implemented in a biologically plausible way.

Seung and Sempolinsky show that, in a population code with wide tuning curves and poisson noise and under the conditions described in their paper, the response of neurons near threshold carries exceptionally high information.

A principal manifold can only be learned correctly using a SOM if

  • the SOM's dimensionality is the same as that of the principal manifold
  • the noise does not 'smear' the manifold too much, thus making it indistinguishable from a manifold with higher dimensionality.
  • there are enough data points to infer the manifold behind the noise.

The SOM has ancestors in von der Malsburg's "Self-Organization of Orientation Sensitive Cells in the Striate Cortex" and other early models of self-organization

SOMs tend to have greater unit densities for points in data space with high data density. They do not follow the density strictly, however.

Zhang examines conditions for when Naïve Bayes classifier can be optimal regardless of their inherent limitations, and argues that these conditions are common enough to explain some of their success.