unsupervised-learning

Dendrogram: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Sam.H

2022年6月1日 10:03

I am trying to plot a Dendrogram to cluster data but this error is stopping me. My datea is here. I first chose columns to work with: df_euro = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv') samples = df_euro.iloc[:, 2:7].values[:42] country_names = df_euro.iloc[:, 1].values[:42] # Calculate the linkage: mergings mergings = linkage(samples , method = 'complete') # Plot the dendrogram dendrogram( mergings, labels = y, leaf_rotation = 90, leaf_font_size = 6 ) plt.show() But I'm getting this error which I can't understand. I googled it and …

Topic: unsupervised-learning scipy clustering machine-learning

Category: Data Science

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

Ross leavitt

2022年5月31日 23:07

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …

Topic: unsupervised-learning outlier algorithms machine-learning

Category: Data Science

An Unsupervised learning method suitable for large categorical data sets

HoonP

2022年5月27日 22:05

I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …

Topic: unsupervised-learning anomaly-detection categorical-data machine-learning

Category: Data Science

Clustering with hierarchical data dependencies

ml_ds_lm

2022年5月27日 01:01

I am currently looking into how to cluster data with hierarchical dependencies. An example of a problem that I want to cluster: we would like to cluster cities to identify similar characteristics with respect to inhabitants. As input data, I have some characteristics such as the age, weight, height and sex of the inhabitants. Each city will therefore be modeled by a vector : ______________ _ _ number of people aged 20 years old | x_1 | number of people …

Topic: unsupervised-learning clustering machine-learning

Category: Data Science

Clustering 2D curves

Alessio Marchetti

2022年5月26日 17:00

I have a set of curves in 2D space each expressed as a set of (sampled) data points. Each set has more or less the same number of items - eventually I guess I’ll use binning to make sure the number of points is the same (say 50) if that can help. I would like to cluster the curves in N groups. Computing N should be part of the solution. Possible translations on the first dimension are irrelevant. I have …

Topic: unsupervised-learning clustering machine-learning

Category: Data Science

Can K-Means cluster label be fixed

GabS

2022年5月25日 17:03

Is there any way to fix the K-Means cluster label. I am working with 4 clusters and whenever I run the python program from the beginning the cluster labels change. Is it possible to fix the cluster labels. I am trying to play with the parameter random state, but does not seem to work.

Topic: unsupervised-learning machine-learning

Category: Data Science

Kmeans cluster validation when I have labeled test data

Aniket Bote

2022年5月24日 02:01

I'm trying to implement the unsupervised k-means algorithm for sentiment analysis of imdb movie dataset created by stanford. The steps that I followed is : 1) Load the comments 2) Apply tokenization and stemmetion ,use tf-idf algo to create tfidf matrix. 3) Use k-means algo to divide the data into 2 clusters. My problem is how do I validate the the clusters I have labeled test data. I want to check if all the negative examples go in one cluster …

Topic: unsupervised-learning sentiment-analysis python k-means

Category: Data Science

Cross-Validation for Unsupervised Anomaly Detection with Isolation Forest

Camilo Piñón Blanco

2022年5月24日 00:04

I am wondering whether I can perform any kind of Cross-Validation or GridSearchCV for unsupervised learning. The thing is that I have the ground truth labels (but since it is unsupervised I just drop them for training and then reuse them for measuring accuracy, auc, aucpr, f1-score over the test set). Is there any way to do this?

Topic: isolation-forest unsupervised-learning cross-validation machine-learning

Category: Data Science

approach for predicting machine failure using maintenance history

Connor Abraham

2022年5月23日 22:07

I have been struggling with this problem for a while now and I finally decided to post a question here to get some help. The problem i'm trying to solve is about predictive maintenance. Specifically, a system produces 2 kinds of maintenance messages when it runs, a basic-msg and a fatal-msg, a basic message indicates that there is a problem with the system that needs to be checked (its not serious), a fatal-msg on the other hand signals that the …

Topic: machine-learning-model unsupervised-learning deep-learning random-forest machine-learning

Category: Data Science

Identifying potential customers based on their Rank and Value

swetha

2022年5月23日 05:04

I have a dataset which has demographic data available for a list of new customers. the data does'nt include transaction data of the customers. I want to identify the top 100 potential customers among these customers. Im aware that we can make use of clustering to segment these customers.However, I have two more variables in my data which are Rank and Value. What approach should be taken when rank and value of customers are given.How do we interpret the clusters …

Topic: unsupervised-learning clustering machine-learning

Category: Data Science

reuse of LDA model for new data

El Pandario

2022年5月22日 10:56

I am working with the LDA (Latent Dirichlet Allocation) model from sklearn and I have a question about reusing the model I have. After training my model with data how do I use it to make a prediction on a new data? Basically the goal is to read content of an email. countVectorizer = CountVectorizer(stop_words=stop_words) termFrequency = countVectorizer.fit_transform(corpus) featureNames = countVectorizer.get_feature_names() model = LatentDirichletAllocation(n_components=3) model.fit(termFrequency) joblib.dump(model, 'lda.pkl') # lda_from_joblib = joblib.load('lda.pkl') I save my model using joblib. Now I want …

Topic: unsupervised-learning scikit-learn lda machine-learning

Category: Data Science

Anomaly detection - relation between thresholds and anomalies

Giordano

2022年5月20日 18:06

I'm developing an anomaly detection program in Python. Main idea is to create a new LSTM model every day, training it with the previous 7 days and predict the next day. Then, using thresholds, find anomalies day by day. I've already implemented that and these thresholds are working well: upper threshold is equals to trimmed_mean + (K * interquartile_range) lower threshold is equals to trimmed_mean - (K * interquartile_range) where trimmed_mean and interquartile_range are calculated on prediction error (real curve …

Topic: unsupervised-learning anomaly-detection time-series python machine-learning

Category: Data Science

Is it impossible to predict defects with data that are not labeled?

hahaha

2022年5月19日 02:52

There is manufacturing data with 10 process variables. Normal and bad labeling are not done. It's tabular fdata. Do you have a paper that only uses data that are not labeled to predict defects or to find variables that affect them? I thought about using the Outlier Detection Algorithm (Isolation Forest, Autoencoder) to predict defects, but I can't find a way because I don't know the exact defect rate. I can't think of a way to verify it, so I'd …

Topic: unsupervised-learning anomaly-detection time-series

Category: Data Science

massively imbalanced data

ahman

2022年5月18日 08:23

I am dealing with time series data with +200K (every minute for 6 months)record of gas turbine I am trying to early detect the fault (0 or 1-fault). The issues with the data are: 1.the fault occurred only 5 times (by observing the sudden shutdown). make the data hugely imbalanced. 2.(unsupervised) No binary output. I used 2 of the variables as my output and used them for binary clustering (kmeans) but the result not very good as there are false …

Topic: data-science-model prediction unsupervised-learning machine-learning

Category: Data Science

unsupervised anomaly detection for univariate fast frequency time series data?

user10296606

2022年5月18日 05:07

I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …

Topic: pipelines unsupervised-learning anomaly-detection scala time-series

Category: Data Science

Conceptual clustering with sklearn?

michaelsnowden

2022年5月16日 13:21

How can I perform conceptual clustering in sklearn? My use case is that I have English Wikipedia articles that I'm doing unsupervised learning on (tfidf -> truncated svd -> l2 normalize), and I'd like to create a hierarchy for them such that the nodes at the top are the most general articles (e.g. Programming Languages -> Functional Languages -> Haskell). I tried using hierarchy.linkage, but it seems that the algorithm uses n^2 space, and I ran out of memory. I …

Topic: unsupervised-learning scikit-learn clustering

Category: Data Science

Can clustering results based on probability be used for supervised learning?

hahaha

2022年5月11日 16:33

I'm a beginner and I have a question. Can clustering results based on probability be used for supervised learning? Manufacturing data with 80000 rows. It is not labeled, but there is information that the defect rate is 7.2%. Can the result of clustering by adjusting hyperparameters based on the defect rate be applied to supervised learning? Is there a paper like this? Is this method a big problem from a data perspective? When using this method, what is the verification …

Topic: unsupervised-learning supervised-learning clustering machine-learning

Category: Data Science

Choosing attributes for k-means clustering

Borut Flis

2022年5月9日 08:46

The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes. I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter …

Topic: noise unsupervised-learning k-means clustering

Category: Data Science

What's the good index to choose number of clusters so that obtained clusters are homogeneous?

jakes

2022年5月8日 07:07

I perform a clustering on one-dimensional dataset and I need a way to automatically decide what's the optimal number of clusters from $k \in \{2, 3, 4, 5, 6\}$. The number of observations to cluster is low (usually around 10-13). I think I'd need to check optimising for one of two goals (or both at the same time) and see what works best: to achieve partitioning with the lowest within-cluster variances. Intuitively, I would go for something like average within-cluster …

Topic: unsupervised-learning clustering

Category: Data Science

Is it possible to make a label automatically in supervised learning(Machine Learning)?

Sean.G

2022年5月7日 19:02

My background knowledge: Basically, supervised learning is based on labeled data. Using the labeled data, the machine can study and determine results for unlabeled data. To do that, for example, if we handle picture issue, manpower is essentially needed to cut raw photo, label on the photos, and scan on the server for fundamental labeled data. I know it sounds weird, but i'm just curious if there are any algorithms/system to make a label automatically for supervised learning.

Topic: unsupervised-learning deep-learning algorithms machine-learning

Category: Data Science

About