How to group every data point with HDBSCAN to some group to have no noise?

Question

How to group every data point with HDBSCAN to some group to have no noise?

sogu

2022年4月21日 18:02

TASK

I am clustering products with about 70 dimensions ex.: price, rating 5/5, product tag(cleaning, toy, food, fruits)
I use HDBSCAN to do it

GOAL

The goal is when users come on our site and I can show similar products to what they viewing.

QUESTION

How to get all data point to be part of a group, so the goal is to not to have any noise?

CODE

clusterer = hdbscan.HDBSCAN(min_cluster_size=10,#smallest collection of data points you consider a cluster
                            min_samples=1 #LARGER this value - more points will be declared as NOISE
                           ).fit(data)

color_palette = sns.color_palette('Paired', 2000)
cluster_colors = [color_palette[x] if x = 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=20, linewidth=0, c=cluster_member_colors, alpha=0.25)


labels = clusterer.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)

Topic noise unsupervised-learning dbscan python clustering

Category Data Science

Multivac · Accepted Answer · 2021年4月3日 15:13

DBSCAN will always mark noisy points according to epsilon and min_samples parameters, so there is no way to avoid that unless you have very compact and "well defined" clusters, what seems unlikely.The short answers will be to use another clustering algorithm such as gaussian mixture, k-means, or Birch

If your problem really needs you to use DBSCAN, you can try performing a quantile tranformation (uniform) prior clustering

Example:

scaler = Pipeline([("imputer", SimpleImputer(strategy= "constant", fill_value= 0)),("transformer", QuantileTransformer(output_distribution = "uniform"))])

model = Pipeline([("scaler", scaler), ("cluster",HDBSCAN(min_cluster_size=10, min_samples=1))]).fit(data)

Dimitrios Panagopoulos · Accepted Answer · 2021年4月1日 12:35

I think you can't. Maybe you should execute the clustering and then assign points that are noise to the closest cluster. I.e. for every such point find the closest point that belongs to a cluster and assign this cluster to the noise point.

How to group every data point with HDBSCAN to some group to have no noise?

About