How to go about training a NER model to extract book citations in free-form?
I'm doing a project where I wish to create a graph visualization of free-form citations (not academic style citations) across all my e-books. E.g. David Foster Wallace's essays cite a lot of other books by different authors. For that I should be able to detect and extract book and authors names from my own e-books.
I've selected some examples from my e-books that I wish my NER model would tag as books (in bold font):
(...) or even the parodistic version of Pater to be found in W. H. Mallock’s The New Republic (...)
Plato words the same conception beautifully in the Republic: (...)
I also wish to tag authors, but I suppose this could be done out-of-the-box with Spacy or other NLP library, with some pre trained PERSON tag.
So, my question is about the best approach to go about creating this NER model.
I could create lots and lots of training samples from my books and create a new NER model. (very time consuming)
Or if there is a dataset or public model with the BOOK or something like WORK_OF_ART tag I could bootstrap my own dataset.
What do you think about this approaches?
Topic information-extraction spacy named-entity-recognition nlp
Category Data Science