CRFSuite/Wapiti: How to create intermediary data for running a training?

After having asked for and been suggested two pieces of software last week (for training a model to categorize chunks of a string) I'm now struggling to make use of either one of them.

It seems that in machine learning (or at least, with CRF?), you can't just train on the training data directly, but you have to go through an intermediary step first.¹

From the CRFsuite tutorial:

The next step is to preprocess the training and testing data to extract attributes that express the characteristics of words (items) in the data. CRFsuite internally generates features from attributes in a data set. In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy.

Wapiti doesn't need such an attribute file created, I think because it has patterns instead which seem somewhat more sophisticated than CRFsuite's intermediary-format files.

To provide an example: given a large number (many tens of thousands) of strings such as these three:

  • Michael went to his room.

  • Did you know Jessica's mom used to be with the military?

  • Amanda! Come back inside! We'll have dinner soon!

From which manually a smaller number (few thousands) of labelled training and test data have been created, such as this block (for the first example above):

T Michael
K went
K to
K his
K room
S .

K Did
K you
K know
T Jessica's
K mom
K used
K to
K be
K with
K the
K military
S ?

T Amanda
S !
K Come
K back
K inside
S !
K We'll
K have
K dinner
K soon
S .

(T for names, K for non-names, S for punctuation, N for numbers.)

How do I figure out what the attributes should be, to be able to create an equivalent to the chunking.py script used in the CRFsuite tutorial?


¹: With regard to that intermediary step, the terminology used by Naoaki Okazaki is not clear to me. Features and Attributes are used interchangeably and seem to refer to something invisible contained in the data. Labels might be the categories in which to put the tokens, and then there's also Observations.

Topic labelling nlp

Category Data Science


It's true that it's a bit of a complex process but it's worth understanding it in order to get the best out of the model.

"Feature" and "attribute" (and probably observation but I'm not 100% sure) are the same thing. The features are the ones directly used by the model (as opposed to the raw input data). For every input word a vector of binary features is generated based on the input data following the custom "patterns" defined in the configuration file. Note that I'm using the word "data" because the input data doesn't have to be only the text, it can optionally include additional information as columns, for example POS tags (as obtained by a POS tagger) and syntactic dependencies (as obtained by a dependency parser).

This kind of information is often very useful for the model: if the model can only use the text then the default binary features are made of a basic one-hot-encoding of the words. This means that the model can only use conditions based on whether word == x or word != x. To see why this is not enough: the word "12345" is different from "12346" in the same way that the word ";" is different from "paleontology", i.e. in this example the model can not capture the fact that "12345" and "12346" are both numbers.

Additionally the patterns allow the model to use other "neighbour features", which is why the notation is a bit complex. The idea is that the label may depend not only on the features of the current word but also on the features of the previous word, or the one before that. In other words, this allows the model to take into account the context in the sequence.

Finally it's usually also possible to define the dependencies between labels. For example there might some sequences of labels which cannot happen, and this information can help the model to determine the correct label for the current word by taking into account the previous/next label in the sequence.

Ok that's a very short summary, now how to decide which patterns to use? Well, the most common option is to try a few configurations, then test and tune them manually. It's also possible to automatize this process but it's rarely worth the effort imho.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.