How to do feature engineering for email cleaning / text extraction?

I have a large batch of email data that I want to analyse. In order to do that, I need to first prepare the data, as the messages are quite often >80% noise. Generally speaking, my dataset's structure is nowhere near that of the ENRON dataset. I need to get rid of signatures, headers and, most importantly, automatically appended legal / security disclaimers.

I have been doing some research and so far I've seen two supervised learning approaches to this problem - one using a multilabel sequential learner on a stream of lines; the other using multiple binary SVMs to find lines that open / close a block of text of a particular type (signature, header etc.).

I am confused with the way feature engineering is done in such problems. The papers I've read suggest a set features that mixes pattern matching and some general text processing (e.g.: line length, starting character). It does not seem obvious how these people have arrived at these particular rules for encoding their data. How do I ensure that the features I identify do a good job of generalising my data and do not introduce heavy bias during classification?

Are there some general principles I should follow when trying to come up with a set of features or is it completely dataset-dependent?

Topic feature-engineering supervised-learning feature-selection

Category Data Science


Any text that is automatically appended can be removed with rule-based logic. The most appropriate rule-based logic is regular expression (i.e., regex). You can write a regex pattern that captures most of the "noise" in your email dataset.

The specific patterns to filter out are domain and problem specific. One way to think about the patterns as a collection of stop words, aka commonly occurring text that has minimal predictive value.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.