Convert nominal to numeric variables?

I am trying to develeop an algorithm with sklearn and Tensorflow to predict which car can be offer to each customer.

To do that I have a database with the answers of one survey to 1000 customers.

An example of questions/[Answers] are:

  1. Color/[Green,Red,Blue]
  2. NumberOfPax/[2,4,5,6,7]
  3. HorsePower/[Integer]
  4. InsuranceIncluded[yes/no/Don't know]

As you can see all questions are answer previously tipified, and in case the answer can be open I validate the value to be an integer or a radio button.

The purpose of that beahivour is that despite the categorical variables I can easily use sklearn to clustering the data.

Will be a good approach to translate this categories to numerical value as an intern procedure an then cluster with this references?

For example: yes -- 0; No -- 1; Don't know -- 2

Then sklearn will cluster with all variables as numerical values.

I have thought this possibility beacuse I believe that sklearn can not cluster nominal data.

What do you think about this approach?

Topic numerical scikit-learn classification categorical-data machine-learning

Category Data Science


Clustering on categories is not something sklearn can do by default. And assigning sequential values to categories like that certainly won't help - clustering tends to work based on distance, by assigning 0, 1, 2 to Yes, No, Don't Know like that, you are suggesting Yes is 'closer' to No than it is to Don't Know.

I highly recommend having a look at k-modes, a clustering algorithm for categorical data. Essentially it optimises according to how common the set of categories are within the cluster (modes, rather than the means of values). E.g. may find green people carriers (5/6/7 pax) in a cluster, and red sports cars (2/4 pax).

There is a Python library here, which also has links to some papers describing the algorithm. There is also k-prototypes, which clusters with combined numerical and catagorical data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.