Which would be an ideal model to get a specific sub string from a bigger string?
I have a corpus of documents whose some lines have information like this:
wt 210 1b 14.4 oz (98 kg)
or
weight: 219 lb (99 kg), height: 5' 1.9 (157 cm)
The format of occurrence of such information varies from document to document. I need the value or the substring corresponding to weight and weight only. Here are my questions regarding the problem:
I have certain regexes that can get the weight value for labeling the lines. However, I do not know how to provide a string as the y-axis, should I convert it to TFIDF vector? Won't that make y-axis hyper-dimensional?
My first intuition is to use Extractive summarizer trained on many other such lines. Is there a better way to handle that?
Thank you.
Topic automatic-summarization tfidf nlp
Category Data Science