Evaluation method for multi-class classification problem modeled as binary classification problem

I should mention that even though I have some basic knowledge regarding ML, it is the first big ML project I am working on and for the proposal of my research project I need to suggest an evaluation metric.

The problem is a multiclass(16 classes) classification problem where one data point can be classified in multiple classes (not ranking based though). I plan to model it as a binary classification problem for each class but for the related evaluation metrics I was not able to find a proper application. So, first of all, should I evaluate individual performance for each class (how well class A classification is working), should I go for a general evaluation (This data point belongs A,B,C but at the end classified for A and B only), or both? Second, what kind of metrics can I have a look at? Finally, I haven't started working on the data yet but I expect an unbalanced distribution for my classes. Would it affect my results?

Topic evaluation classification

Category Data Science


Unbalanced data will definitely be a problem and should be addressed. In particular, "accuracy" will not be dependable metric any more so if you decide the use unbalanced data directly, so you should use other metrics that are more reliable for such scenarios, but that also can depend on the data distribution you have. Here is a discussion of how each metric perform for different situations.

Apart from the choice of a proper metric, There are other ways to deal with imbalanced data. Here you can find some of the methods but probably most well known way to deal with it is oversampling and undersampling, in which you equate the number of samples in each class.

Regarding the individual evaluation vs general evaluation, this probably depends on your own preference and the classes you have in the problem. But individual classification will be useful at least to fine tune your model: you would not prefer a model which performs very well on some set of samples but very badly on some other set. Such a situation might indicate that the features used might not be as useful for each class and require getting more data.

And finally, relatively small differences in the size of classes might not be a problem and ignored; but this might be more of a personal choice. What are the relative sizes of the classes?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.