Extracting events with attributes from unstructured text

Question

Extracting events with attributes from unstructured text

Strabonio

2021年2月28日 18:00

I am scraping websites of organisations (mostly retailers) and I want to use NLP to extract information from the websites’ unstructured text. The first thing I want to do is to identify covid-related events in the text, for example “The shop will be closed from the 3rd of March” or “Unfortunately we have to close permanently.” The lexicon is rather limited, involving perhaps a few dozen (or hundreds at most) phrases/expressions.

I am very familiar with regular expressions, and I think it is possible to use a rule-based approach to extract some events and their attributes (e.g., dates), particularly with a small lexicon. However, the limitations of the rules are obvious (it is easy to miss expressions with small variations), and I would like to use also some ML approaches. I am familiar with ML approaches like sentiment analysis and topic modelling, but they seem to be designed for classification problems, rather than on this kind of extraction of specific attributes and data points from text. I also know NER that would work well to get dates and place names, for example, but not for events (e.g., closure of a shop x at date y).

Are there smarter ways to do this kind of NLP, going beyond the manual definition of several RegEx? Perhaps a lexical pattern learning from annotated examples?

Topic information-extraction text-mining nlp

Category Data Science

Erwan · Accepted Answer · 2021年1月27日 10:16

I think the closest standard NLP task would be relationship extraction. In general it's a quite complex task which involves NER, syntactic analysis and semantic role labeling.

Note that there are various works using the term "event extraction" (for example this), but as far as I know there is no clear definition of the task. It's often related to putting events on a timeline, this would be quite different from your goal but possibly related.

A basic approach would be to treat the problem as a sequence labeling task like NER: given some annotated "events" in a training corpus, the model might be able to learn the patterns and detect any new "event" in a text.

Extracting events with attributes from unstructured text

About