Using Sci-Kit Learn Clustering and/or Random-Forest Classification on String Data with Multiple Sub-Classifications

Question

Using Sci-Kit Learn Clustering and/or Random-Forest Classification on String Data with Multiple Sub-Classifications

Lawrence

2022年5月18日 00:28

I have a set of data with some numerical features and some string data. The string data is essentially a set of classes that are not inherently related. For example:

Sample_1,0.4,1.2,kitchen;living_room;bathroom
Sample_2,0.8,1.0,bedroom;living_room
Sample_3,0.5,0.9,None

I want to implement a classification method with these string-subclasses as a feature; however, I don't want to have them be numerically related or have the comparisons be directly based on the string itself. Additionally, if samples have no data in this column they should not be inherently related.

Is there a way to implement these features as classes in a way that doesn't rely on a distance metric? I originally wanted to try converting the classes directly to numerical data, but I am worried that arbitrarily class 1 would be considered more closely related to class 2 than class 43.

Topic machine-learning-model multiclass-classification scikit-learn python machine-learning

Category Data Science

sandyp · Accepted Answer · 2022年5月18日 00:28

1

sandyp answered at 2022年5月18日 00:28

You use something called "dummy encoding".

Using Sci-Kit Learn Clustering and/or Random-Forest Classification on String Data with Multiple Sub-Classifications

About