What is the best way to feature engineer features which have more than one repeated values?

What is the best way to feature engineer features which have more than one repeated values ? I want to parse this data and finally keep in a pandas df for further analysis. Example, I have data of people's profile which consists of

Name, Age, Gender, Company, Degree

Now it is easy to keep Name , age and gender which has specific single value, but company can have more than one value or multiple value like someone worked with Google or Microsoft or both Google, Microsoft.

Same case with Degree , people can have single as well as multiple values together.

Right now I have kept them as comma separated values like if someone has more than one company then value is Google, Microsoft. While I encode them using say sklearn Label Encoder I get different codes like Google = 1 Microsoft = 2 Google, Microsoft = 3

Which I guess is not very accurate, as when the data increases it will explode with number of combinations also, if I have to find similar features of those who worked at Google I might not get the correct answer as code 2 and code 3 will never match.

Is there a better way to handle such data ?

Topic feature-engineering machine-learning

Category Data Science


Depending on the kind of ML problem you're facing, there might be more or less suitable methods.

Have you tried one-hot-encoding ? It should actually answer your question, by defining people as dense vectors filled with 0s excepted at a given company's coordinates, where they'll show a 1. Well it's kinda bruteforce and the drawback is that it will for sure increase the dimension of your feature vectors, but it's a good starting point.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.