Unsupervised Hierarchical Agglomerative Clustering

I've read a number of papers where the authors talk about "Unsupervised Hierarchical Agglomerative Clustering". They seem to imply that the algorithm determines the number of clusters based on a hyper-parameter: We define the hetereogeneity metric within a cluster to be the average of all-pair jaccard distances, and at each step merge two clusters if the heterogeneity of the resultant cluster is below a specified threshold When I search for python implementations of Agglomerative Clustering I keep coming up with …
Category: Data Science

How to score different clusters of features for predictiveness?

I have a set of true/false data that represents whether or not a given feature was or was not active when the data snapshot was recorded. Data snapshots are recorded when the user takes an action. The goal is to find clusters of features that were true at the same time that are predictive of the user taking said action. To provide some more context, I'm working on a program that is meant to analyze data recorded while players play …
Category: Data Science

Understanding hierarchical clustering features importance

I made a hierarchical clustering with scikit : selected_model = AgglomerativeClustering(n_clusters=8) hierarchical_clustering8 = selected_model.fit_predict(answers) This classification was done on the basis of 50 features and led me to 8 clusters. How can I proceed to determine the importance of each feature in this classification ? My goal is to determine the most important and least important features for each cluster, and to be able to explain each cluster.
Category: Data Science

ggplot2 for Cluster analysis (non-readible row names)

I have made a cluster analysis and ended up with dendrogram; however the row names are not readible (made a red rectangle). May I ask if there is way to adjust it? library("reshape2") library("purrr") library("dplyr") library("dendextend") dendro <- as.dendrogram(aggl.clust.c) dendro.col <- dendro %>% set("branches_k_color", k = 5, value = c("darkslategray", "darkslategray4", "darkslategray3", "gold", "gold2")) %>% set("branches_lwd", 0.6) %>% set("labels_colors", value = c("darkslategray")) %>% set("labels_cex", 0.5) ggd1 <- as.ggdend(dendro.col) ggplot(ggd1, theme = theme_minimal()) + labs(x = "Num. observations", y = "Height", …
Category: Data Science

Results interpretation of AgglomerativeClustering labelling

First of all I would like to say that I'm quite new to python and even more new to scikit, and I'm also a self learner, so please forgive my banal question, but it doesn't look banal to me. So, I have the following cosine similarity matrix as a DataFrame: m1 m2 m3 m4 m5 m1 1.000 0.179 0.775 0.673 0.544 m2 0.299 1.000 0.333 0.521 0.232 m3 0.656 0.440 1.000 0.444 0.722 m4 0.578 0.154 0.623 1.000 0.891 m5 …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.