Extracting events with attributes from unstructured text
I am scraping websites of organisations (mostly retailers) and I want to use NLP to extract information from the websites’ unstructured text. The first thing I want to do is to identify covid-related events in the text, for example “The shop will be closed from the 3rd of March” or “Unfortunately we have to close permanently.” The lexicon is rather limited, involving perhaps a few dozen (or hundreds at most) phrases/expressions.
I am very familiar with regular expressions, and I think it is possible to use a rule-based approach to extract some events and their attributes (e.g., dates), particularly with a small lexicon. However, the limitations of the rules are obvious (it is easy to miss expressions with small variations), and I would like to use also some ML approaches. I am familiar with ML approaches like sentiment analysis and topic modelling, but they seem to be designed for classification problems, rather than on this kind of extraction of specific attributes and data points from text. I also know NER that would work well to get dates and place names, for example, but not for events (e.g., closure of a shop x at date y).
Are there smarter ways to do this kind of NLP, going beyond the manual definition of several RegEx? Perhaps a lexical pattern learning from annotated examples?
Topic information-extraction text-mining nlp
Category Data Science