CRFSuite/Wapiti: How to create intermediary data for running a training?
After having asked for and been suggested two pieces of software last week (for training a model to categorize chunks of a string) I'm now struggling to make use of either one of them.
It seems that in machine learning (or at least, with CRF?), you can't just train on the training data directly, but you have to go through an intermediary step first.¹
From the CRFsuite tutorial:
The next step is to preprocess the training and testing data to extract attributes that express the characteristics of words (items) in the data. CRFsuite internally generates features from attributes in a data set. In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy.
Wapiti doesn't need such an attribute file created, I think because it has patterns instead which seem somewhat more sophisticated than CRFsuite's intermediary-format files.
To provide an example: given a large number (many tens of thousands) of strings such as these three:
Michael went to his room.
Did you know Jessica's mom used to be with the military?
Amanda! Come back inside! We'll have dinner soon!
From which manually a smaller number (few thousands) of labelled training and test data have been created, such as this block (for the first example above):
T Michael
K went
K to
K his
K room
S .
K Did
K you
K know
T Jessica's
K mom
K used
K to
K be
K with
K the
K military
S ?
T Amanda
S !
K Come
K back
K inside
S !
K We'll
K have
K dinner
K soon
S .
for names, K
for non-names, S
for punctuation, N
for numbers.)
How do I figure out what the attributes should be, to be able to create an equivalent to the
script used in the CRFsuite tutorial?
¹: With regard to that intermediary step, the terminology used by Naoaki Okazaki is not clear to me. Features and Attributes are used interchangeably and seem to refer to something invisible contained in the data. Labels might be the categories in which to put the tokens, and then there's also Observations.
Category Data Science