Extract a numeric attribute from partially unstructured text for each word of a vocabulary
Given a vocabulary
v = {'sales', 'units', 'parts', 'operators', 'revenue'}
and strings such as
s1 = 'total of 1138 units, repaired 7710 parts, sales increased 588 (+34), decreasing of operator 413 (-14)'
s2 = 'part 7710 (repaired), units are 1138, revenue 1212, operators variation is -14, salles increment +34 (588 total)'
I have to associate each key of v
with the corresponding number from s1
and s2
(for sales
and operators
I need the variations (numbers with a sign in the front), i.e. +34
and -14
), that is for s1
we have to obtain
key | attribute |
---|---|
sales | +34 |
units | 1138 |
parts | 7710 |
operators | -14 |
revenue | none |
for s2
is the same table except for 1212
instead of none
.
Notice that:
- there is some sort of structure in the text data since each string contains some commas
,
dividing the string into different parts, each of them containing a word of the vocabulary and one number (two in the case ofsales
andoperators
). - keys contained in the strings may be written badly since they are manually typed, i.e. in
s1
there isoperator
instead ofoperators
, ins2
there arepart
instead ofparts
andsalles
instead ofsales
I wrote a simple python script (using mainly regex) doing the job in most cases, and now I'd like to try with a machine learning algorithm to learn how it works and compare the results. I have many manually labelled strings (i.e. string + table) which I could use to train a neural network, but since I'm a novice I don't know where to start.
Which model is most suitable for this task? NER? BERT? I searched for examples on keras site, here and on google to see if somebody had already treatened this kind of tasks, but didn't find one, and I think it's because I don't know which terms to use in the search query. I tried with something like data mining unstructured data, example of supervised NER and keras example text extraction, but they all are kind of vague.
Is this a problem of multiclass classification? Text mining? Features generation? I think is not NLP since the algorithm doesn't have to learn the meaning of a sentence.
Topic text-mining neural-network
Category Data Science