train NER using NLTK with custom corpora (non-english) must use StanfordNER?

Question

train NER using NLTK with custom corpora (non-english) must use StanfordNER?

Mico S

2022年2月16日 18:00

I have searched about customization NER corpora for trainig the model using NLTK library from python, but all of the answer direct to nltk book chapter 7 and honestly makes me confuse how to train the corpus with correct flow and data set that has structure like this below:

Eddy N B-PER
Bonte N I-PER
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N B-ORG
. Punc O

I have some questions:

I found so many article that if you will train customed corpora using NLTK, there uses StanfordNER library too, should it be? or we can use pure of NLTK library for it?
Should the grammar pattern be included if you want to apply it to other languages? How is the flow?

And please give me example of code to train custom corpora until give the tag of POS Tag and NER label output using data like data structure above if you have. Thank you.

Topic nltk named-entity-recognition nlp

Category Data Science

Erwan · Accepted Answer · 2021年1月11日 22:52

It's true that the nltk book doesn't seem clear about this. Traditionally NER models are trained with Conditional Random Fields so I searched for "nltk crf" and found this SO question which points to this detailed example for NER.

To answer your questions:

nltk itself doesn't appear to propose a CRF model, the example above relies on an interface with CRFSuite (as mentioned in the SO question). It's probably possible to use other interfaces such as StanfordNER.
The tricky part is to define your own features, i.e. the conditions that the model uses as features at every step in the sequence. This is how you can specify any kind of specific "grammar rule" you want the model to use.

The example above looks complete but I didn't test anything.

train NER using NLTK with custom corpora (non-english) must use StanfordNER?

About