How to deal with name strings in large data sets for ML?

Question

How to deal with name strings in large data sets for ML?

Danny Abstemio

2022年6月1日 23:04

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later.

Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side (Label binarization) and no direct comparison between names on the other side (Label encoding).

Are there other approaches to use or transform especially name information in order to work with ML algorithms?

Topic preprocessing classifier encoding nlp python

Category Data Science

spectre · Accepted Answer · 2021年12月21日 17:21

If you want to encode high cardinality features then look at this. Basically using OneHotEncoder for a high cardinal feature will increase your dimensionality to a large extent.

So instead there are other encoders which would prevent this issue. But beware, not all encoders can work with all types of categorical data. Read the documentation carefully before using them!

Victor Oliveira · Accepted Answer · 2019年3月6日 12:21

You problem is essentially you have high cardinality in your features, right? This will be relative to your problem, but you can look for mean encodings. Essentially, you will replace names by the mean on target variable, however, this is highly prone to overfitting and you should take care.

The following two videos will give an excellent explanation:

However, I would also consider taking out sensitive information such as name depending on your application, always think about if the features makes sense.

I hope this helps, any question let a comment.

How to deal with name strings in large data sets for ML?

About