Does feature engineering require absolute accuracy?
Sometimes when I'm studying the datasets, the text field is particularly challenging to handle. For whatever features I want to derive from the text fields, I try to apply some heuristic to approximate certain text patterns from the text fields so I can extract some features. (Think of those heuristics as some self-invented regex...). I'm concerned about the sanity of such heuristic: do people in practice also approximate and extract the features with heuristic only? (e.g. it may leads to some rows wrongly represented, but generally works fine for most rows) or do we need to insist on deriving absolute heuristics/rules so that no rows are mis-featured? Can the model tolerate our approximated heuristics?
One example: if I've got a field of product names, I try to extract the brand name from it. Despite most rows starts the product name with the brand name, but some don't. If I blindly apply the rule that the first x words belong to brand names, I may mistakenly extracted some words that are generic (e.g. handbags, cosmetics, etc.) that are not brand name at all! Do we usually tolerate such inaccuracy in feature engineering?
Topic representation feature-engineering
Category Data Science