Entity Embeddings of email address

I have a set of email address e.g. [email protected], [email protected], [email protected], [email protected].....

Is it possible to apply ML/Mathematics to generate category (like NER) from Id (part before @). Problem with straight forward application of NER is that the emails are not proper english.

Topic named-entity-recognition nlp machine-learning

Category Data Science


Named entity recognition (NER) is categorizing proper nouns in an extended context, typically a sentence. If you only have emails, NER techniques will not work.

The problem could be framed as multi-class classification, predict one of several labels from a collection of features. The labels are {Person, Company, Place}. The features are the parts of the email (i.e., local-part and domain).

One difficult issue is generalization. Can a model learn that certain series of letters is a person vs a company? Probably not to a high level of performance. One way to increase the performance of the model is add more and better features. For example, length of email or number of recipients.


It is possible but you would need lot of training data to reach a good result, because there is a wide variety of family and company names.

Fortunatelly, there could be an efficient solution to make a good classification.

My advice is to focus on human names recognition on one side, company name recognition on the other side, and then apply ML.

For human names, there are plenty of datasets available to recognize family names and first names that you can filter in the fields (ex: Gupta is recognized in "guptamols" => Name).

For company names, you can use dictionaries in english or any other language to detect lot of names (ex: textile is recognized in AgraTextile).

Once you do this safe classification, you would have lot of valuable labelled data, by which a NLP model (like Bert - I would recommend a byte per byte embedding as there could be special characters in companies) could learn patterns in order to classify the rest of the unknown data easily.

Note: Such models give a probability chance for each case that could be useful to limit the risk of wrong classification.


Well, it's possible but it wouldn't work: NER models rely on indications in the text close to the entity, for example it finds X to be a location in the sentence "Peter went to X by train" because "to go to" is likely to be followed by a location (and "by train" makes it even more likely). So the problem is that the email doesn't have any context information about the category.

I think regular classification would be more likely to work, but the main question is how to represent the strings so that the model distinguishes the categories. It could be with n-grams of characters or maybe characters embeddings, but it's not sure that this would work well.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.