Does feature engineering require absolute accuracy?

Sometimes when I'm studying the datasets, the text field is particularly challenging to handle. For whatever features I want to derive from the text fields, I try to apply some heuristic to approximate certain text patterns from the text fields so I can extract some features. (Think of those heuristics as some self-invented regex...). I'm concerned about the sanity of such heuristic: do people in practice also approximate and extract the features with heuristic only? (e.g. it may leads to some rows wrongly represented, but generally works fine for most rows) or do we need to insist on deriving absolute heuristics/rules so that no rows are mis-featured? Can the model tolerate our approximated heuristics?

One example: if I've got a field of product names, I try to extract the brand name from it. Despite most rows starts the product name with the brand name, but some don't. If I blindly apply the rule that the first x words belong to brand names, I may mistakenly extracted some words that are generic (e.g. handbags, cosmetics, etc.) that are not brand name at all! Do we usually tolerate such inaccuracy in feature engineering?

Topic representation feature-engineering

Category Data Science


Your eventual model will assume the features you give it are the truth, so this will affect your model. (The adage "garbage in garbage out" comes to mind.) However, if enough of the engineering works correctly, the model will only be slightly negatively affected, being driven by the (hopefully) much more common correct engineering.

In the end, I think you should think of this as any other source of uncertainty: balance the accuracy needs from your final model against the cost of developing better feature engineering.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.