So i am performing k-means clustering on RFM variables (Recency, Frequency, Monetary). The RFM variables are in the form of quantiles (1-4). I used PCA and found the PCA components. I then used the elbow method to find the optimal number of clusters and then I use it in the k-means algorithm. Could anyone guide me if this is a correct method? Further, the clusters I get range on the graph, their axis ranges from -3 to 3 and I …
So I worked on a hierarchical clustering algorithm to be able to determine which items are most similar, and what attributes are most important. I have two tables: Table 1: contains a bunch of item codes, and it's attribute (brand, flavor, sales, and so on). It looks something like: Item_code | Brand | Flavor | Caloric_content | ... | sales 006891313 | Coke | Original | 0 | .... 002349823 | Fanta | Orange | 200 | ... The other …
I've recently started working on an application for visualization of really big datasets. While reading online it became apparent that most people use HDF5 for storing big, multi-dimensional datasets as it offers the versatility to allow many dimensions, has no file size limits and is transferable between OSs. My question is how to best deal with very large files. I am working with datasets that have 3-dimensions, all of which have large number of components (example size: 62,500 x 500,000 …
I am writing an algorithm which allows to detect activities based on wearable data. I would like to try it out an hierachical approach (Local Classifier Per Parent Node structure). In the first level, I determine the intensity of the activity (1 classifier), and in the second level I determine the activity label (3 classifiers). I am however struggling with how I need to approach the training/testing/validation of such a structure. What I did now is: Split data into 2 …
I will be grateful for any help. I have multivariate time series, where every one of them has an unique ID. Also, there is a variable giving information about the trend type of the ID from a point of view of a single variable which we consider important. The problem is, I need to understand, how is behaviour (or trends) of other variables (time series in ID) affecting the inclusion of the ID to a specific stated trend category. I …
I have genes from different species Gene A , Gene B, Gene C, ... Gene Z Some Genes are similar to each other A & G are 96% similar C & H are 92% similar G & B are 89% similar G & T are 85% similar . . . K & F are 52% similar I want to classify these genes into groups of species Species A, B, T, G are the same species Species C, H, N, R, …
I have input features of a hierarchical structure. Each feature consists of a header element and 0 to n subfeatures of the same structure. Also, there is no upper limit for n and n can be different from feature to feature. It should also be possible to establish relationships between features with a different number of subfeatures. How can I format this data so that it can be used to train different (machine) learning algorithms? Example of one input feature …
I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a hierarchical way. For example, in my sample data below the rank of my variables is as follow: Medical, NoID and Refuse. Ex. if a visit was cancelled due to medical and lack of ID reasons, I would like to recode …
So i have a dataset with variables with unit of measurement as milligrams, kgs and quintals. Should i use standard scaler or minmaxscaler to scale the dataset.
Assume we have a dendogram (hierarchical clusterisation tree), can we define a data partitioning in K clusters, by cutting the branches of the tree at some levels in the tree below the root node?
I have a dendrogram represented in a format I don't understand: (K_5:1.000030e+00,((K_1:2.000000e-05,(K_2:1.000000e-05,K_3:1.000000e-05):1.000000e-05):1.000000e-05,K_4:3.000000e-05)0.806:1.000000e+00):0.000000e+00; I am not sure how to interpret the above. It is an output of hierarchical clustering. K_1, K_2, K_3, K_4, K_5 are the data points. I have other dendrograms represented in the following format: [x_1,x_2,x_3,x_4,x_5] (we start with one big cluster and split a cluster at each step) [x_1,x_2][x_3,x_4,x_5] [x_1,x_2][x_3,x_5][x_4] [x_1][x_2][x_3,x_5][x_4] [x_1][x_2][x_3][x_5][x_4] I want a way to convert between these two representations.
Here is a common problem in health care modeling. Did I just invent a new algorithm or has someone already thought of this? The goal is to find the most homogeneous partition of patients by medical costs using ICD9 codes. There are 13,000 individual codes in the data set, so using the full code results in many only having a few observations. ICD9 codes are in a nested hierarchical structure. For instance, all infectious diseases are 001-139, one particular disease …
Right now, I am currently working on implementing a clustering algorithm with millions data entries with regards to game users for a mobile game. A lot of the features I plan on using are unique to this game (data that can only be analyzed if one knows the game well), and thus I believe that it is best for my data that I come up a new function to generate the distance matrix that I plan on using in the …
I have data with huge categorical attributes. For example, main_column, sub_column1, sub_column2 are 3 hierarchical attributes. If if take dummy variable on these columns the column count is increased to 1000. How to handle this kind of hierarchical attributes for a classification problem ? Thanks !!
I have a large table where each record or row represents a single salesperson, and there are 50 columns or dimensions where each column represents one of 50 products potentially sold by any given salesperson, with one final column representing their total compensation as a percentile of their salesperson peers. The values within each column range from 0 to 100, reflective of the salesperson's percentile performance in sales for that product, and then in the final column, percentile in total …
My goal is to implement an assistant for crawling web data for users that don't understand anything about HTML or DOM. I will show a web page to the user and the user has to select, what data he is interested on the page (or what data he is not interested in). Example: If the user clicks on the cell inside a table, it is very likely he wants to extract all elements inside that column. He might only be …
I'm trying to solve a fairly basic problem in NPL efficiently. What tool or software package would you use to identify the words, or group of words that are part of an given ontology within a free text. Let's imagine the inputs are the following dummy ontology: And this publication's abstract: This study evaluates the addition of metformin to standard of care in locally advanced and metastatic prostate cancer, half the patients will receive metformin in combination with standard treatment, …
I have a set of datapoints from the unit interval (i.e. 1-dimensional dataset with numerical values). I receive some additional datapoints online, and moreover the value of some datapoints might change dynamically. I'm looking for an ideal clustering algorithm which can handle these issues efficiently. I know sequential k-means clustering copes with the addition of new instances, and I suppose with minor modification it can work with dynamic instance values (i.e. first taking the modified instance from the respective cluster, …