How to group every data point with HDBSCAN to some group to have no noise?

TASK

  • I am clustering products with about 70 dimensions ex.: price, rating 5/5, product tag(cleaning, toy, food, fruits)
  • I use HDBSCAN to do it

GOAL

  • The goal is when users come on our site and I can show similar products to what they viewing.

QUESTION

  • How to get all data point to be part of a group, so the goal is to not to have any noise?

CODE

clusterer = hdbscan.HDBSCAN(min_cluster_size=10,#smallest collection of data points you consider a cluster
                            min_samples=1 #LARGER this value - more points will be declared as NOISE
                           ).fit(data)

color_palette = sns.color_palette('Paired', 2000)
cluster_colors = [color_palette[x] if x = 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=20, linewidth=0, c=cluster_member_colors, alpha=0.25)


labels = clusterer.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)

Topic noise unsupervised-learning dbscan python clustering

Category Data Science


DBSCAN will always mark noisy points according to epsilon and min_samples parameters, so there is no way to avoid that unless you have very compact and "well defined" clusters, what seems unlikely.The short answers will be to use another clustering algorithm such as gaussian mixture, k-means, or Birch

If your problem really needs you to use DBSCAN, you can try performing a quantile tranformation (uniform) prior clustering

Example:

scaler = Pipeline([("imputer", SimpleImputer(strategy= "constant", fill_value= 0)),("transformer", QuantileTransformer(output_distribution = "uniform"))])

model = Pipeline([("scaler", scaler), ("cluster",HDBSCAN(min_cluster_size=10, min_samples=1))]).fit(data)

I think you can't. Maybe you should execute the clustering and then assign points that are noise to the closest cluster. I.e. for every such point find the closest point that belongs to a cluster and assign this cluster to the noise point.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.