Best strategy for extracting specific structured data from unstructured sentences

Question

Best strategy for extracting specific structured data from unstructured sentences

Daan

2021年2月20日 00:45

Given a list of sentences like this:

4 to 5 hours over a period of 16 weeks
1st session: 2.0-2.5 hours 2nd session: 1.5-2.0 hours
Approximately 5-6 visits over the course of 5 months. Visit 1, 3, 5: about 1.5 hours. Visit 2, 4: short
15 visits over a period of approximately 74 weeks.
You will come to the organization about 12 times, over a period of a little more than three years. Each visit will take from 3-6 hours.

What tools/strategy should I use if I want to let the model spit out the following data for the above sentences:

Number of sessions	Total duration(h)	Total timespan(w)
Unknown	4-5	16
2	3.5-4.5	Unknown
5-6	4.5	20
15	Unknown	74
12	36-72	156

I'm a ML beginner and wondered if this is achievable with Tensorflow or GPT? For further learning on my own: what is the specific terminology I should google for? Is this NER, text extraction or more like text classification?

Topic openai-gpt tensorflow

Category Data Science

Erwan · Accepted Answer · 2021年2月20日 00:45

The task is a specific case of NER (technically NER is a sequence labeling task, a special case of classification).

I think you would have two main options:

Apply a pre-trained NER model: most deal with time entities but not always very accurately, and it wouldn't be specifically adapted to your data so you wouldn't obtain the distinction between the three types of values. Advantage: no need for training data.
Train your own NER model: that's the ideal scenario in terms of performance, assuming you have (or can have) a good amount of annotated data for training.

Best strategy for extracting specific structured data from unstructured sentences

About