How to do feature engineering for email cleaning / text extraction?
I have a large batch of email data that I want to analyse. In order to do that, I need to first prepare the data, as the messages are quite often >80% noise. Generally speaking, my dataset's structure is nowhere near that of the ENRON dataset. I need to get rid of signatures, headers and, most importantly, automatically appended legal / security disclaimers.
I have been doing some research and so far I've seen two supervised learning approaches to this problem - one using a multilabel sequential learner on a stream of lines; the other using multiple binary SVMs to find lines that open / close a block of text of a particular type (signature, header etc.).
I am confused with the way feature engineering is done in such problems. The papers I've read suggest a set features that mixes pattern matching and some general text processing (e.g.: line length, starting character). It does not seem obvious how these people have arrived at these particular rules for encoding their data. How do I ensure that the features I identify do a good job of generalising my data and do not introduce heavy bias during classification?
Are there some general principles I should follow when trying to come up with a set of features or is it completely dataset-dependent?
Topic feature-engineering supervised-learning feature-selection
Category Data Science