# Loss function for regression with uncertain labels

I have a regression task, for which I’m training a model with MSE loss. So for label $$y$$ and estimation $$hat{y}$$ the loss is
$$ell(y,hat{y})=(y-hat{y})^2$$
However, there is an uncertainty in the “true” labels, which varies across labels. So each true label is drawn from a distribution for which I can obtain a reasonable estimate for any statistic e.g. the standard deviation.

I’d like the loss to reflect the variation in the true label $$y$$. I thought about simply normalizing by the standard deviation of each label

$$ellleft(y,hat{y}right)=left(frac{y-hat{y}}{sigmaleft(yright)}right)^{2}$$

Or, since sometimes $$sigma(y)=0$$, maybe

$$ellleft(y,hat{y}right)=left(frac{y-hat{y}}{1+sigmaleft(yright)}right)^{2}$$

But this seems too ad-hoc. Is there a standard theory or approach people use in this sort of situation?

Cross Validated Asked on November 14, 2021

Usual approach in statistics is to consider the errors $$epsilon_i= y_i-E[y_i|x]$$ homoscedastic with variance $$sigma^2$$. This assumption, joint with independence one, results in least squares as the loss function for estimating $$E[y_i|x]$$.

If your measures of $$y$$ are themselves variable, the variance of errors should be $$sigma^2 + sigma(y_i)^2$$. This results in a loss function of $$sum_i w_i(y_i-hat y_i)^2$$, where $$w_i= (sigma^2 + sigma(y_i)^2)^{-1/2}$$.

Problem is that $$sigma^2$$, a.k.a residual variance, is not known, and has to be estimated, and it can't be estimated afterwards the rest of the model, which needs it to properly define loss function. Solution is given by Iteratively Reweighted Least Squares. That's a quite intuitive algorithm, one simple explanation is available in section 2.3 of this document.

Answered by carlo on November 14, 2021

## Related Questions

### Specifying specific priors for a correlation matrix via Stan

1  Asked on December 11, 2020 by sue-doh-nimh

### Example of mean independent variables but dependent still

0  Asked on December 11, 2020 by luchonacho

### When are observations not weakly exchangeable?

1  Asked on December 11, 2020 by rumtscho

### How big should my subsample be?

1  Asked on December 11, 2020 by kaecvtionr

### Spirtes’ example of d-separation not leading to independence in a directed cyclic graph with non-linear structural equations

1  Asked on December 10, 2020 by quant_dev

### Asymptotic normality for nonsmooth objective functions

1  Asked on December 10, 2020

### Regression: is it wrong to bin a continuous variable to overcome overfitting?

1  Asked on December 10, 2020 by st4co4

### How do you compare standard deviations?

2  Asked on December 10, 2020 by yaynikkiprograms

### How to interpret the beta estimates of a generalized linear model with a square root power link?

0  Asked on December 10, 2020 by statboy_41

### Can k-fold CV help reduce sampling bias?

0  Asked on December 9, 2020 by aite97

### Why is the standard deviation of the average of averages smaller than the standard deviation of the total?

0  Asked on December 9, 2020 by pinocchio

### Calculating bias of ML estimate of AR(1) coefficient

1  Asked on December 9, 2020 by andrew-kirk

### Using residuals from linear regression for normality testing for ANOVA

0  Asked on December 9, 2020 by s-ramagokula-krishnan

### How does scaled conjugate gradient work in neural network training? Comparison with gradient descent

0  Asked on December 9, 2020 by johanna

### For B-spline what does $sum_{i=0,n}N_{i,k}(t)=1$ mean?

1  Asked on December 9, 2020

### Is it possible to detect overfitting automatically/programmatically after model creation?

0  Asked on December 9, 2020 by ayberk-yavuz

### R lmer model: degree of freedom and chi square values are zero

1  Asked on December 9, 2020 by roromario

### Random Censoring scheme in Weibull Distribution

0  Asked on December 8, 2020 by soham-bagchi

### fixed effects vs random effects vs random intercept model

1  Asked on December 8, 2020 by daniela-rodrigues

### Immediate NaN in loss function with custom activation without extreme batch size–how to prevent exploding gradients?

0  Asked on December 8, 2020 by rain