Using Sci-Kit Learn Clustering and/or Random-Forest Classification on String Data with Multiple Sub-Classifications

I have a set of data with some numerical features and some string data. The string data is essentially a set of classes that are not inherently related. For example:

Sample_1,0.4,1.2,kitchen;living_room;bathroom
Sample_2,0.8,1.0,bedroom;living_room
Sample_3,0.5,0.9,None

I want to implement a classification method with these string-subclasses as a feature; however, I don't want to have them be numerically related or have the comparisons be directly based on the string itself. Additionally, if samples have no data in this column they should not be inherently related.

Is there a way to implement these features as classes in a way that doesn't rely on a distance metric? I originally wanted to try converting the classes directly to numerical data, but I am worried that arbitrarily class 1 would be considered more closely related to class 2 than class 43.

Topic machine-learning-model multiclass-classification scikit-learn python machine-learning

Category Data Science


You use something called "dummy encoding".

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.