Classification with noisy labels, noise is structured and not random

I am building a classification model with mislabeled training data on the order of ~70% of the training data is labeled correctly and ~30% is labeled incorrectly. Knowing this, how can I quantify the error rate for my model? For example, if I have 85% accuracy on the test set, of those 85% how many come from the 70% that are actually labeled correctly?

I also have to say that the labels aren’t mislabeled completely randomly either. There is certainly a relationship between my predictors and whether or not the label is correct. I have a few hundred possible labels and around 1 million records. The data are survey responses describing occupations. So common mislabellings will have write ins that contain words such as “Office manager” where this could land in any number of codes.

Is there any literature on this? Maybe some sort of confidence interval I can build for the error rate?

Cross Validated Asked on November 14, 2021

0 Answers

Add your own answers!

Related Questions

Ask a Question

Get help from others!

© 2021 All rights reserved.