Which would be an ideal model to get a specific sub string from a bigger string?

I have a corpus of documents whose some lines have information like this:

wt 210 1b 14.4 oz (98 kg)

or

weight: 219 lb (99 kg), height: 5' 1.9 (157 cm)

The format of occurrence of such information varies from document to document. I need the value or the substring corresponding to weight and weight only. Here are my questions regarding the problem:

  1. I have certain regexes that can get the weight value for labeling the lines. However, I do not know how to provide a string as the y-axis, should I convert it to TFIDF vector? Won't that make y-axis hyper-dimensional?

  2. My first intuition is to use Extractive summarizer trained on many other such lines. Is there a better way to handle that?

Thank you.

Topic automatic-summarization tfidf nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.