How to train a machine learning model for named entity recognition

I cannot find any sources about the architectures of machine learning models to solve for NER problems. I vaguely knows it is a multiclass classification problem, but how can we format our input to feed into such multiclass classifier? I know the inputs must be annotated corpus, but how can we feed that chunk of pairs of (word, entity label) into the classifier? Or, how do you feature-engineer such corpus to feed into ML models? Or, in general, how can you train a custom NER from scratch with machine learning?

TIA.

Topic named-entity-recognition nlp machine-learning

Category Data Science


There are actually many libraries for training NER models.

  • It's useful to know that this type of model/task is called sequence labeling because it consists in predicting a label for every word, taking into account the other words close to the target word.
  • The standard method is Conditional Random Fields (CRF). There are various libraries, see for example this answer.
  • Traditionally a specific format called BIO (sometimes IOB) which stands for Begin, Inside, Outside is used as input (see a very short example). The features can involve context words through custom patterns (see the documentation of the libraries for details).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.