How to go about training a NER model to extract book citations in free-form?

I'm doing a project where I wish to create a graph visualization of free-form citations (not academic style citations) across all my e-books. E.g. David Foster Wallace's essays cite a lot of other books by different authors. For that I should be able to detect and extract book and authors names from my own e-books.

I've selected some examples from my e-books that I wish my NER model would tag as books (in bold font):

(...) or even the parodistic version of Pater to be found in W. H. Mallock’s The New Republic (...)

Plato words the same conception beautifully in the Republic: (...)

I also wish to tag authors, but I suppose this could be done out-of-the-box with Spacy or other NLP library, with some pre trained PERSON tag.

So, my question is about the best approach to go about creating this NER model.

  • I could create lots and lots of training samples from my books and create a new NER model. (very time consuming)

  • Or if there is a dataset or public model with the BOOK or something like WORK_OF_ART tag I could bootstrap my own dataset.

What do you think about this approaches?

Topic information-extraction spacy named-entity-recognition nlp

Category Data Science


Interesting task :)

I think even with a good amount of training data it will be difficult for a regular NER model to perform well with new books titles and authors:

  • The book may contain persons names which are not authors.
  • The book titles are difficult to identify as such in general. For example "the Republic" might or might not be about the book, and if the only indication the model can use is the capitalization it's probably going to make some errors.

To be clear, I think it could work to some extent but it would probably make quite a lot of errors.

On the other hand you could obtain a database of books, for instance from Wikipedia (there might be better resources), and you could use this in two ways:

  1. Directly identify the books/authors in the documents by simple string matching. I would imagine that even if the coverage of the resource is not perfect, this method would easily catch a majority of occurrences.
  2. In case the above method is not sufficient, it provides you with some good training data from which you could train a NER model in order collect titles which don't exist in the database. Note that there might be issues due to the unknown books being labelled as negative in the training data, so ideally you would have to go manually through the training data and annotate the remaining cases.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.