Does feature engineering require absolute accuracy?

Question

Does feature engineering require absolute accuracy?

Hing

2022年4月12日 15:20

Sometimes when I'm studying the datasets, the text field is particularly challenging to handle. For whatever features I want to derive from the text fields, I try to apply some heuristic to approximate certain text patterns from the text fields so I can extract some features. (Think of those heuristics as some self-invented regex...). I'm concerned about the sanity of such heuristic: do people in practice also approximate and extract the features with heuristic only? (e.g. it may leads to some rows wrongly represented, but generally works fine for most rows) or do we need to insist on deriving absolute heuristics/rules so that no rows are mis-featured? Can the model tolerate our approximated heuristics?

One example: if I've got a field of product names, I try to extract the brand name from it. Despite most rows starts the product name with the brand name, but some don't. If I blindly apply the rule that the first x words belong to brand names, I may mistakenly extracted some words that are generic (e.g. handbags, cosmetics, etc.) that are not brand name at all! Do we usually tolerate such inaccuracy in feature engineering?

Topic representation feature-engineering

Category Data Science

Ben Reiniger · Accepted Answer · 2022年4月12日 15:20

Your eventual model will assume the features you give it are the truth, so this will affect your model. (The adage "garbage in garbage out" comes to mind.) However, if enough of the engineering works correctly, the model will only be slightly negatively affected, being driven by the (hopefully) much more common correct engineering.

In the end, I think you should think of this as any other source of uncertainty: balance the accuracy needs from your final model against the cost of developing better feature engineering.

Does feature engineering require absolute accuracy?

About