need an explanation of the For Loop in the DBSCAN algorithm Demo

In the following code of the DBSCAN algorithm, as a beginner I need an explanation for what happens to the data in the bottom for loop and why ? Generate sample data import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) …
Category: Data Science

DB-Scan with ring like data

I've been using the DBScan implementation of python from sklearn.cluster. The problem is, that I'm working with 360° lidar data which means, that my data is a ring like structure. To illustrate my problem take a look at this picture. The colours of the points are the groups assigned by DBScan (please ignore the crosses, they dont have anything to do with the task). In the picture I have circled two groups which should be considered the same group, as …
Category: Data Science

Clustering Tweet Data using DBSCAN Algorithm

I am doing a tweet clustering using DBSCAN algorithm. I use all the preprocessing steps and convert sentences to vector format before applying the algorithm. However, It always puts a lot of tweets in to the '0' class. The following is the plot showing eps with number of clusters. The following are the parameters that I pass. dbscan = DBSCAN(eps=0.15, min_samples=2, metric='cosine').fit(x) The following are the resulting clusters. label -1 1221 0 1349 1 2 2 2 3 4 ... …
Category: Data Science

Is my data good for (DBSCAN) clustering?

I have a particular dataset consisting of 50k elements with 40 features each. I want to try to cluster the data as it is, without any dimensionality reduction. The main algorithm I am considering is the DBSCAN since is the more versatile and I can accept some poits to result as noise. However how can I judge if the clustering is "significant" since I can't plot the clusters in comparison to the data? Tring to select the paremeters for the …
Category: Data Science

Shall I use ordinal encoding or One-Hot-Encoding when using DBSCAN for content clustering on websites?

I want to cluster the preparation steps on cooking recipes websites in one cluster so I can distinguish them from the rest of the website. To achieve this I extracted for each text node of the website the DOM path (e.g. body->div->div->table->tr ....) and did a One-Hot-Encoding before I executed the DBSCAN clustering algorithm. My hope was, that the DBSCAN algorithm recognizes also not only 100% identical DOM-paths as 1 common cluster, because sometimes one preparation step is e.g. in …
Category: Data Science

How to group every data point with HDBSCAN to some group to have no noise?

TASK I am clustering products with about 70 dimensions ex.: price, rating 5/5, product tag(cleaning, toy, food, fruits) I use HDBSCAN to do it GOAL The goal is when users come on our site and I can show similar products to what they viewing. QUESTION How to get all data point to be part of a group, so the goal is to not to have any noise? CODE clusterer = hdbscan.HDBSCAN(min_cluster_size=10,#smallest collection of data points you consider a cluster min_samples=1 …
Category: Data Science

DBSCAN on textual and numerical columns

I have a dataset which has two columns: title price sentence1 12 sentence2 13 I have used doc2vec to convert the sentences into vectors of size 100 as below: LabeledSentence1 = gensim.models.doc2vec.TaggedDocument all_content = [] j=0 for title in query_result['title_clean'].values: all_content.append(LabeledSentence1(title,[j])) j+=1 print("Number of texts processed: ", j) cores = multiprocessing.cpu_count() d2v_model = Doc2Vec(dm=1, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores, alpha=0.025, min_alpha=0.001) d2v_model.build_vocab([x for x in tqdm(all_content)]) all_content = utils.shuffle(all_content) d2v_model.train(all_content,total_examples=len(all_content), epochs=30) So d2v_model.docvecs.doctag_syn0 returns me vectors of …
Category: Data Science

The actual results and results from pickle files are not matching in pandas for DBSCAN clustering

I've built a DBSCAN clustering model. The output result and the result after using the pickle files are not matching. Based on HD and MC column, I am clustering WT column. data = HD,MC Target = WT Below, for 1st record the cluster is 0. But after running it from 'pkl' file, it is showing predicted result as [-1] Dataframe: HD MC WT Cluster 200 Other 4.5 0 150 Pep 5.6 0 100 Pla 35 -1 50 Same 15 0 …
Category: Data Science

Is it safe to use labels created from unsupervised model to train a supervised model using the same data?

I have a dataset where I have to detect anomalies. Now, I use a subset of the data(let's call that subset A) and apply the DBSCAN algorithm to detect anomalies on set A.Once the anomalies are detected, using the dbscan labels I create a label variable (anomaly:1, non-anomaly:0) in the dataset A. Now, I train a supervised algorithm on dataset A to predict the anomalies using the label as the dependent/target variable and finally use the trained supervised model to …
Category: Data Science

Clustering events in a sequence

I have a sequence of recurring events that I would to group together into representing different operation activities of the underlying process. These events may have an order in their occurrence; or maybe not. Consequently, I would like to explore and investigate if any relationship exists between the events. Are there any better methods than using Hierarchical clustering? I might want to build a model that can determine the operational activity based on the events it recognized as belonging to …
Category: Data Science

DBSCAN getting one huge cluster with noisy points

I'm currently trying to cluster customer service email answers (NLP). When I use DBSCAN with TF-IDF embeddings + Annoy indexes, I get good clusters. But, when I use DBSCAN with FastText embeddings + Annoy indexes, I get good clusters except the cluster with label zero (0) which seems to include lots of noisy points (that should be labeled with -1 instead of 0). Anyone with and idea of what this can be? I'm using an eps=0.5 for both cases.
Category: Data Science

How to use Cosine Distance matrix for Clustering algorithms like mean-shift, DBSCAN, and optics?

I am trying to compare different clustering algorithms for my text data. I first calculated the tf-idf matrix and used it for the cosine distance matrix (cosine similarity). Then I used this distance matrix for K-means and Hierarchical clustering (ward and dendrogram). I want to use the distance matrix for mean-shift, DBSCAN, and optics. Below is the part of the code showing the distance matrix. from sklearn.feature_extraction.text import TfidfVectorizer #define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000, min_df=0.2, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, …
Category: Data Science

What's an appropriate clustering quality estimate / metric for precomputed distance in HDBSCAN?

HBDSCAN supports estimation of clusters from precomputed distances. However, the python implementation of HDBSCAN (scikit-contrib) doesn't create minimum spanning trees in the absence of raw data when precomputed distance matrices are provided as inputs. Therefore, it doesn't compute the relative_validity score or DBCV score to facilitate hyperparameter tuning in such instances. I am trying to use a Euclidean projection (squareroot transform) of Gower dissimilarity composite (without Podini's option) as a precomputed metric in HDBSCAN. Since distance-based scores like Silhuette are …
Category: Data Science

How to automatically cluster a set of parallel curves?

I have an ensemble of datasets, each one containing one or more parallel curves in a 2-dimensional domain. Each curve is formed by individual 2-dimensional points: What I want I am trying to automatically cluster the curves to extract information such us what points belong to each curve and how many curves are there in the given dataset. The only clustering algorithm that comes to my mind to perform this task is DBSCAN, which needs two parameters that must be …
Category: Data Science

When use standardization, normalization or both?

I have a dataset with variables with different scales as shown in the figure below. I need to group individuals together and I'm testing algorithms like Kmeans and DBScan. In all tests I'm extracting the two main components with PCA. When I don't apply any transformation before PCA (neither standardization nor normalization), almost all individuals are in a single cluster. The same happens when I apply one or another transformation (standardization OR normalization). I only get meaningful results if I …
Category: Data Science

Is there any clustering algorithm to find longest continuous subsequences?

I have data which contains access duration of some items. Example: t0~t5 is the access time duration, 1 means the items was accessed in the time duration, 0 means it wasn't. ID,t0,t1,t2,t3,t4 0,0,0,1,1,1 1,0,1,1,1,1 2,0,1,1,0,0 3,1,1,0,0,1 4,1,1,0,0,1 In the above example, groups ID=0,1 are what I want. ID=3,4 aren't because their distance is short but they are not continuous. I tried KMeans and DBSCAN, they all cluster ID=3,4 as one group and it makes sense. But it doesn't do what …
Category: Data Science

Types of artificial anomalies

I am working on some algorithms for anomaly detection The dataset is clean our anomalies so I want to add some artificial anomalies. I have added some anomalies. I get the maximum value of the dataset and add 20-25%, meaning these added anomalies are bigger than the max value by 20 to 25% Are there any other types of anomalies that would be nice to have had in an anomaly detection algorithm dataset? My dataset is with integers and float
Category: Data Science

How to calculate diameter of clusters for DBSCAN?

I've created several clusters for my task. Now I'd like to know the distance among the far points in each cluster. # Generate sample data X = np.loadtxt('C:/1.csv', delimiter=',') X = StandardScaler().fit_transform(X) # ############################################################################# # Compute DBSCAN db = DBSCAN(eps=0.8, min_samples=20).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) n_noise_ = list(labels).count(-1) print('Estimated number of clusters: %d' …
Category: Data Science

Distance between any two points after DBSCAN

DBSCAN is a clustering model which is robust to detect the outliers also. A parameter $\epsilon$ i.e. radius is an input of the algorithm, a point is said to be outlier if it's circle with radius $\epsilon$ has no point except that point of center. I have detected the outliers for a dataset, but then I observed that all pair distances is less than $\epsilon$. I'm just confused now, Is my understanding of DBSCAN wrong or there should be some …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.