Using scikit-learn FeatureHasher

I have a huge data set with one of the columns named 'mail_id'. The mail_id is given in a very creepy format as shown below:

mail_id
DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs=
K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q=
UGATDXARg7jMEInKH7oXgty2nwxnwD2l0OW/8Nsa0MI=
qE9zgWiITYA97RUiN4X/t9IVWLViLz+lKijaYegyBiQ=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
4+EEK8RbNYwuFCHznY9XSRCV4Yek60bHVgnP3jtjjzk=

After doing a lot of analysis on my data, I have found that I cannot drop this feature set from my model so I have to convert it to something meaningful. Can anyone please explain me how to do this efficiently?

Topic hashing-trick feature-engineering scikit-learn feature-extraction machine-learning

Category Data Science


I'd say there are pros and cons of using FeatureHasher for this purpose. If you really striving to use it, then just instantiate it like this:

In [1]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=5, input_type='string')
f = h.transform(mail_id)
f.toarray()

Out[1]:
array([[ 1.,  0.,  0.,  0.,  0.],
      [ 0., -1.,  0.,  0.,  0.],
      [ 1.,  0.,  0.,  0.,  0.],
      [ 0.,  0., -1.,  0.,  0.],
      [ 0., -1.,  0.,  0.,  0.]])

So, after you have instantiated it, just .transform each your upcoming mail_id and use results in upstream applications ( like online learning, for instance ). Obviously n_features is some knob to tune. But this has its flip side: the cardinality of mail ids is apriori high, so unless you have very limited amount of users you will need enormous n_features to minimize collisions.

The better would be to take logs, where your ids coappear, and learn item2vec style model. This will deliver much denser (and meaningful) representation of mail_ids than FeatureHasher would do.

Also, take a look at this.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.