Various algorithms performance in a problem and what can be deduced about data and problem?
HI I am currently trying to apply various algorithms to a classification problem to assess which could be better and then try to fine tune the bests of the first approach. I am a beginner so I use Weka for now. I have basic ML concept understanding but am not in the details of algorithms yet.
I observed that on my problem, RBF networks performed vastly worse than IBK and other K methods.
From what I read about RBF networks, "it implements a normalized Gaussian radial basisbasis function network. It uses the k-means clustering algorithm to provide the basis functions and learns either a logistic regression (discrete class problems) or linear regression (numeric class problems) on top of that. Symmetric multivariate Gaussians are fit to the data from each cluster. If the class is nominal it uses the given number of clusters per class.It standardizes all numeric attributes to zero mean and unit variance."
So basically, it also use k means to classify at first. But for some reason, I get the worst results with it using my metrics (ROC), while K methods are among the bests. Can I deduce from that fact something important about my data, like the fact that it has not a gaussian distribution, or is not fitted for logistic regression, or whatever I can't figure out?
I also observed that random forests get similar results to K methods, and that adding a filter to reduce dimensionality improved these RF, random projection being better than PCA?
Can this last point means that there is much randomness in my data so random dimension reduction is better than "ruled" dimension reduction like PCA? What can I deduce from the fact that RF perform equally to K methods?
I feel there is some signification here, but I am not skilled enough to understand what, and I would be very glad for any insights. Thanks by advance.
Topic weka random-forest dimensionality-reduction k-means machine-learning
Category Data Science