What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

Question

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

Ross leavitt

2022年5月31日 23:07

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told that K Means would not work so well since the outliers would skew the centroids of a particular cluster. Any and all thoughts welcome!

Topic unsupervised-learning outlier algorithms machine-learning

Category Data Science

bapowell · Accepted Answer · 2020年4月24日 15:16

DBSCAN is a density-based clustering method that is designed to apply to cases with noise. The user controls the minimum cluster size, which hopefully can be informed by the problem, and clusters that are smaller than this are ignored as noise.

xChesster · Accepted Answer · 2020年3月24日 03:43

You could try a hierarchical clustering approach. As an example, K clusters could initially be found for the data points. Then, for each of the K clusters, an arbitrary number of clusters could be found from the data points within the cluster to further refine the clustering.

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

About