Without feedback about ground truth, a system learning from a data set of noisy corresponding values must have at least three modalities to learn their reliabilities.
One way of doing this is learning pairwise correlation between modalities.
It is not enough to take the best hypothesis on the basis of the currently learned reliability model and use that instead of ground truth to learn the variance of the individual modalities:
If the algorithm comes to believe that one modality has near-perfect reliability, then that will determine the next best hypotheses.
In effect, that modality will *be* ground truth for the algorithm and it will only learn how well the others predict it.⇒

Zhou et al. use an approach similar to that of Bauer et al.
They do not use *pairwise cross-correlation* between input modalities, but simply *variances* of individual modalities.
It is unclear how they handle the case where one modality essentially becomes ground truth to the algorithm.⇒