How to train a machine learning algorithm with multiple labels

I have the following challenge and I very much hope that there is a solution to it. I also suspect that there is a simple approach to it. I just don't see it at the moment. Any help or advice is highly appreciated.

So, I have the following situation:

I asked persons to label about 1000 data points (each twice) on a 5-point scale, whose scores are not equi-distant. Texts were assessed with regard to several qualitative characteristics (such as comprehensibility). As was to be expected, the labelers did not always agree on the assessment. By analysing the inter-rater reliability, however, a "substantial" reliability (according to Landis and Koch) could be determined.

Now I want to use the labelled data as input for a machine learning algorithm (e.g. SVM and Random Forest). The challenge now is how to optimize the data in advance. Currently it is the case that for the same sample there are also different labels available.

The average value between different labels does not seem reasonable to me. So are there standard procedures how I can adjust the data set in advance?

You would help me a lot!

Thanks a lot in advance.

Topic labels multiclass-classification supervised-learning machine-learning

Category Data Science


If you intend to use a summary statistics you would engineer it so it is well suited for your task, meaning captures most of the relevant information. For these things there is usually no best universal solution but it is problem specific. You did not specify what your problem is about so I can't help you there much, maybe use the median value.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.