I am trying to plot a Dendrogram to cluster data but this error is stopping me. My datea is here. I first chose columns to work with: df_euro = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv') samples = df_euro.iloc[:, 2:7].values[:42] country_names = df_euro.iloc[:, 1].values[:42] # Calculate the linkage: mergings mergings = linkage(samples , method = 'complete') # Plot the dendrogram dendrogram( mergings, labels = y, leaf_rotation = 90, leaf_font_size = 6 ) plt.show() But I'm getting this error which I can't understand. I googled it and …
I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told …
I want to detect anomalies in the bank data set in an unsupervised learning method. However, in the bank data set, all columns except time and amount were categorical data, and about half of them had more than 90 percent missing values. This data set tries to detect anomalies through unsupervised learning. I'm currently using Autoencoder to access it, but I wondered if this would work. Also, because the purpose is to detect whether data is abnormal when data comes …
I am currently looking into how to cluster data with hierarchical dependencies. An example of a problem that I want to cluster: we would like to cluster cities to identify similar characteristics with respect to inhabitants. As input data, I have some characteristics such as the age, weight, height and sex of the inhabitants. Each city will therefore be modeled by a vector : ______________ _ _ number of people aged 20 years old | x_1 | number of people …
I have a set of curves in 2D space each expressed as a set of (sampled) data points. Each set has more or less the same number of items - eventually I guess I’ll use binning to make sure the number of points is the same (say 50) if that can help. I would like to cluster the curves in N groups. Computing N should be part of the solution. Possible translations on the first dimension are irrelevant. I have …
Is there any way to fix the K-Means cluster label. I am working with 4 clusters and whenever I run the python program from the beginning the cluster labels change. Is it possible to fix the cluster labels. I am trying to play with the parameter random state, but does not seem to work.
I'm trying to implement the unsupervised k-means algorithm for sentiment analysis of imdb movie dataset created by stanford. The steps that I followed is : 1) Load the comments 2) Apply tokenization and stemmetion ,use tf-idf algo to create tfidf matrix. 3) Use k-means algo to divide the data into 2 clusters. My problem is how do I validate the the clusters I have labeled test data. I want to check if all the negative examples go in one cluster …
I am wondering whether I can perform any kind of Cross-Validation or GridSearchCV for unsupervised learning. The thing is that I have the ground truth labels (but since it is unsupervised I just drop them for training and then reuse them for measuring accuracy, auc, aucpr, f1-score over the test set). Is there any way to do this?
I have been struggling with this problem for a while now and I finally decided to post a question here to get some help. The problem i'm trying to solve is about predictive maintenance. Specifically, a system produces 2 kinds of maintenance messages when it runs, a basic-msg and a fatal-msg, a basic message indicates that there is a problem with the system that needs to be checked (its not serious), a fatal-msg on the other hand signals that the …
I have a dataset which has demographic data available for a list of new customers. the data does'nt include transaction data of the customers. I want to identify the top 100 potential customers among these customers. Im aware that we can make use of clustering to segment these customers.However, I have two more variables in my data which are Rank and Value. What approach should be taken when rank and value of customers are given.How do we interpret the clusters …
I am working with the LDA (Latent Dirichlet Allocation) model from sklearn and I have a question about reusing the model I have. After training my model with data how do I use it to make a prediction on a new data? Basically the goal is to read content of an email. countVectorizer = CountVectorizer(stop_words=stop_words) termFrequency = countVectorizer.fit_transform(corpus) featureNames = countVectorizer.get_feature_names() model = LatentDirichletAllocation(n_components=3) model.fit(termFrequency) joblib.dump(model, 'lda.pkl') # lda_from_joblib = joblib.load('lda.pkl') I save my model using joblib. Now I want …
I'm developing an anomaly detection program in Python. Main idea is to create a new LSTM model every day, training it with the previous 7 days and predict the next day. Then, using thresholds, find anomalies day by day. I've already implemented that and these thresholds are working well: upper threshold is equals to trimmed_mean + (K * interquartile_range) lower threshold is equals to trimmed_mean - (K * interquartile_range) where trimmed_mean and interquartile_range are calculated on prediction error (real curve …
There is manufacturing data with 10 process variables. Normal and bad labeling are not done. It's tabular fdata. Do you have a paper that only uses data that are not labeled to predict defects or to find variables that affect them? I thought about using the Outlier Detection Algorithm (Isolation Forest, Autoencoder) to predict defects, but I can't find a way because I don't know the exact defect rate. I can't think of a way to verify it, so I'd …
I am dealing with time series data with +200K (every minute for 6 months)record of gas turbine I am trying to early detect the fault (0 or 1-fault). The issues with the data are: 1.the fault occurred only 5 times (by observing the sudden shutdown). make the data hugely imbalanced. 2.(unsupervised) No binary output. I used 2 of the variables as my output and used them for binary clustering (kmeans) but the result not very good as there are false …
I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …
How can I perform conceptual clustering in sklearn? My use case is that I have English Wikipedia articles that I'm doing unsupervised learning on (tfidf -> truncated svd -> l2 normalize), and I'd like to create a hierarchy for them such that the nodes at the top are the most general articles (e.g. Programming Languages -> Functional Languages -> Haskell). I tried using hierarchy.linkage, but it seems that the algorithm uses n^2 space, and I ran out of memory. I …
I'm a beginner and I have a question. Can clustering results based on probability be used for supervised learning? Manufacturing data with 80000 rows. It is not labeled, but there is information that the defect rate is 7.2%. Can the result of clustering by adjusting hyperparameters based on the defect rate be applied to supervised learning? Is there a paper like this? Is this method a big problem from a data perspective? When using this method, what is the verification …
The k-means clustering tries to minimize the within-cluster scatter and maximizing the distances between clusters. It does so on all attributes. I am learning about this method on several datasets. To illustrate, in one the datasets countries are compared based on attributes related to their Human development Index. However some of the attributes are completely unrelated to this dimension, for example total population of countries. How to deal with this attributes? As mentioned before k-means tries to minimize the scatter …
I perform a clustering on one-dimensional dataset and I need a way to automatically decide what's the optimal number of clusters from $k \in \{2, 3, 4, 5, 6\}$. The number of observations to cluster is low (usually around 10-13). I think I'd need to check optimising for one of two goals (or both at the same time) and see what works best: to achieve partitioning with the lowest within-cluster variances. Intuitively, I would go for something like average within-cluster …
My background knowledge: Basically, supervised learning is based on labeled data. Using the labeled data, the machine can study and determine results for unlabeled data. To do that, for example, if we handle picture issue, manpower is essentially needed to cut raw photo, label on the photos, and scan on the server for fundamental labeled data. I know it sounds weird, but i'm just curious if there are any algorithms/system to make a label automatically for supervised learning.