Best strategy for extracting specific structured data from unstructured sentences

Given a list of sentences like this:

  • 4 to 5 hours over a period of 16 weeks
  • 1st session: 2.0-2.5 hours 2nd session: 1.5-2.0 hours
  • Approximately 5-6 visits over the course of 5 months. Visit 1, 3, 5: about 1.5 hours. Visit 2, 4: short
  • 15 visits over a period of approximately 74 weeks.
  • You will come to the organization about 12 times, over a period of a little more than three years. Each visit will take from 3-6 hours.

What tools/strategy should I use if I want to let the model spit out the following data for the above sentences:

Number of sessions Total duration(h) Total timespan(w)
Unknown 4-5 16
2 3.5-4.5 Unknown
5-6 4.5 20
15 Unknown 74
12 36-72 156

I'm a ML beginner and wondered if this is achievable with Tensorflow or GPT? For further learning on my own: what is the specific terminology I should google for? Is this NER, text extraction or more like text classification?

Topic openai-gpt tensorflow

Category Data Science


The task is a specific case of NER (technically NER is a sequence labeling task, a special case of classification).

I think you would have two main options:

  • Apply a pre-trained NER model: most deal with time entities but not always very accurately, and it wouldn't be specifically adapted to your data so you wouldn't obtain the distinction between the three types of values. Advantage: no need for training data.
  • Train your own NER model: that's the ideal scenario in terms of performance, assuming you have (or can have) a good amount of annotated data for training.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.