Several independent variables based on the same underlying data
I got a data containing, among others, two feature variables, which are based from the same underlying data (i.e. have mutual information), but they convey different information/message. How to handle such cases?
Since, logically, they will be highly correlated, it would make sense to only use one of them, preferably the one which conveys more information. But:
- Is this the correct approach, or do we actually lose a valuable information by not including it?
- If including it is the correct approach, is there anything else needed to be done and/or checked to prevent messing up the model (since they will be highly correlated)?
Example 1:
- Let's say we have a feature which can be pair of any number from
1
to3
, e.g.(1,1)
,(3,2)
,(2,1)
, etc. - And we also have another feature which tells us how many ones (i.e.,
1
) are in the previous feature, so for the previous cases this would correspond to2
,0
,1
, etc. - Although this second feature does not provide us with any new information not already present in the first feature per se (i.e. can be deduced from the first feature), it does have some special meaning, i.e. lets say that the number of ones is expected to influence the results (dependent variable).
Example 2:
- One variable is a discrete/integer value, and the other one is
0
if the value of the first feature is below some specific value, and1
if higher or the same. - Just as in the Example 1, the second feature has some special meaning.