Using PCA to cluster multidimensional data (RFM variables)

So i am performing k-means clustering on RFM variables (Recency, Frequency, Monetary). The RFM variables are in the form of quantiles (1-4). I used PCA and found the PCA components. I then used the elbow method to find the optimal number of clusters and then I use it in the k-means algorithm. Could anyone guide me if this is a correct method? Further, the clusters I get range on the graph, their axis ranges from -3 to 3 and I …
Category: Data Science

What is the best to identify the proper hierarchy of this data?

So I worked on a hierarchical clustering algorithm to be able to determine which items are most similar, and what attributes are most important. I have two tables: Table 1: contains a bunch of item codes, and it's attribute (brand, flavor, sales, and so on). It looks something like: Item_code | Brand | Flavor | Caloric_content | ... | sales 006891313 | Coke | Original | 0 | .... 002349823 | Fanta | Orange | 200 | ... The other …
Category: Data Science

Advice on dealing with very large datasets - HDF5, Python

I've recently started working on an application for visualization of really big datasets. While reading online it became apparent that most people use HDF5 for storing big, multi-dimensional datasets as it offers the versatility to allow many dimensions, has no file size limits and is transferable between OSs. My question is how to best deal with very large files. I am working with datasets that have 3-dimensions, all of which have large number of components (example size: 62,500 x 500,000 …
Category: Data Science

How to train/test/validate hierachical classifiers?

I am writing an algorithm which allows to detect activities based on wearable data. I would like to try it out an hierachical approach (Local Classifier Per Parent Node structure). In the first level, I determine the intensity of the activity (1 classifier), and in the second level I determine the activity label (3 classifiers). I am however struggling with how I need to approach the training/testing/validation of such a structure. What I did now is: Split data into 2 …
Category: Data Science

Looking for an algorithm to perform classification on multivariate grouped time series

I will be grateful for any help. I have multivariate time series, where every one of them has an unique ID. Also, there is a variable giving information about the trend type of the ID from a point of view of a single variable which we consider important. The problem is, I need to understand, how is behaviour (or trends) of other variables (time series in ID) affecting the inclusion of the ID to a specific stated trend category. I …
Category: Data Science

How to cluster/group these data points (using K-Mean or Hirarachal clustering)

I have genes from different species Gene A , Gene B, Gene C, ... Gene Z Some Genes are similar to each other A & G are 96% similar C & H are 92% similar G & B are 89% similar G & T are 85% similar . . . K & F are 52% similar I want to classify these genes into groups of species Species A, B, T, G are the same species Species C, H, N, R, …
Category: Data Science

Input Features of a Hierarchical Structure

I have input features of a hierarchical structure. Each feature consists of a header element and 0 to n subfeatures of the same structure. Also, there is no upper limit for n and n can be different from feature to feature. It should also be possible to establish relationships between features with a different number of subfeatures. How can I format this data so that it can be used to train different (machine) learning algorithms? Example of one input feature …
Category: Data Science

Use dummy variables to create a rank variable. R

I have a series of multiple response (dummy) variables describing causes for a canceled visits. A visit can have multiple reasons for the cancelation. My goal is to create a single mutually exclusive variable using the dummy variables in a hierarchical way. For example, in my sample data below the rank of my variables is as follow: Medical, NoID and Refuse. Ex. if a visit was cancelled due to medical and lack of ID reasons, I would like to recode …
Category: Data Science

Different representations of dendrograms

I have a dendrogram represented in a format I don't understand: (K_5:1.000030e+00,((K_1:2.000000e-05,(K_2:1.000000e-05,K_3:1.000000e-05):1.000000e-05):1.000000e-05,K_4:3.000000e-05)0.806:1.000000e+00):0.000000e+00; I am not sure how to interpret the above. It is an output of hierarchical clustering. K_1, K_2, K_3, K_4, K_5 are the data points. I have other dendrograms represented in the following format: [x_1,x_2,x_3,x_4,x_5] (we start with one big cluster and split a cluster at each step) [x_1,x_2][x_3,x_4,x_5] [x_1,x_2][x_3,x_5][x_4] [x_1][x_2][x_3,x_5][x_4] [x_1][x_2][x_3][x_5][x_4] I want a way to convert between these two representations.
Category: Data Science

Finding the best "depth" of ICD9 codes with pseudo-hierarchical clustering

Here is a common problem in health care modeling. Did I just invent a new algorithm or has someone already thought of this? The goal is to find the most homogeneous partition of patients by medical costs using ICD9 codes. There are 13,000 individual codes in the data set, so using the full code results in many only having a few observations. ICD9 codes are in a nested hierarchical structure. For instance, all infectious diseases are 001-139, one particular disease …
Category: Data Science

Question About Coming Up With Own Function for Distance Matrix (For Clustering)

Right now, I am currently working on implementing a clustering algorithm with millions data entries with regards to game users for a mobile game. A lot of the features I plan on using are unique to this game (data that can only be analyzed if one knows the game well), and thus I believe that it is best for my data that I come up a new function to generate the distance matrix that I plan on using in the …
Category: Data Science

Best classification technique for following kind of data set

I have a large table where each record or row represents a single salesperson, and there are 50 columns or dimensions where each column represents one of 50 products potentially sold by any given salesperson, with one final column representing their total compensation as a percentile of their salesperson peers. The values within each column range from 0 to 100, reflective of the salesperson's percentile performance in sales for that product, and then in the final column, percentile in total …
Category: Data Science

Machine learning for predicting HTML Elements on a web page?

My goal is to implement an assistant for crawling web data for users that don't understand anything about HTML or DOM. I will show a web page to the user and the user has to select, what data he is interested on the page (or what data he is not interested in). Example: If the user clicks on the cell inside a table, it is very likely he wants to extract all elements inside that column. He might only be …
Category: Data Science

Using an ontology to recognize named entities in free text

I'm trying to solve a fairly basic problem in NPL efficiently. What tool or software package would you use to identify the words, or group of words that are part of an given ontology within a free text. Let's imagine the inputs are the following dummy ontology: And this publication's abstract: This study evaluates the addition of metformin to standard of care in locally advanced and metastatic prostate cancer, half the patients will receive metformin in combination with standard treatment, …
Category: Data Science

Efficient dynamic clustering

I have a set of datapoints from the unit interval (i.e. 1-dimensional dataset with numerical values). I receive some additional datapoints online, and moreover the value of some datapoints might change dynamically. I'm looking for an ideal clustering algorithm which can handle these issues efficiently. I know sequential k-means clustering copes with the addition of new instances, and I suppose with minor modification it can work with dynamic instance values (i.e. first taking the modified instance from the respective cluster, …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.