I am building a classification model with mislabeled training data on the order of ~70% of the training data is labeled correctly and ~30% is labeled incorrectly. Knowing this, how can I quantify the error rate for my model? For example, if I have 85% accuracy on the test set, of those 85% how many come from the 70% that are actually labeled correctly?
I also have to say that the labels aren’t mislabeled completely randomly either. There is certainly a relationship between my predictors and whether or not the label is correct. I have a few hundred possible labels and around 1 million records. The data are survey responses describing occupations. So common mislabellings will have write ins that contain words such as “Office manager” where this could land in any number of codes.
Is there any literature on this? Maybe some sort of confidence interval I can build for the error rate?
1 Asked on July 30, 2020 by gabriel-ullmann
0 Asked on July 28, 2020 by gabriel
0 Asked on July 28, 2020 by christopher-u
0 Asked on July 27, 2020 by statsmonkey
Get help from others!