difference between feature interactions and confounding variables
Let me define the problem space.
I am working a binary classification problem. I am trying to build a causal model as well as predictive model.
My aim is to find list of significant features (based on causal model) and use that to build a predictive model. I did refer the suggestions provided in this post and it was very much useful but I have few more questions due to my limitations with ML field.
I understood from literature that there are two ways to adjust/control for confounders. one is through study design
phase and other is during modelling/analysis
phase.
As I am working on retrospective data analysis, I can only adjust confounders during analysis phase.
We know that certain features like Age
in a typical example like "gender causes heart disease" is a confounder.
1) So during analysis phase, we include age
as a variable in our model. Similarly all the potential confounders that we could think of are put in the model as features. ex: X_train
will have all the columns/features that I think of as potential confounders and then it is fed to the model (logistic regression). Am I right till here?
2) Does this mean our LR model is adjusted for confounders? How would you do confounding adjustment during logistic regression modelling phase? If we include all potential confounders in our model and if the coeff
of already existing variables (gender
) change by 10% or so, I understand that age
is a confounder but does this also mean that LR is adjusted for confounders?
3) Then, why it is said that logistic regression doesn't consider feature interaction
? Is feature interaction different from confounding? I understand that feature interaction is usually denoted as gender*age
but does this mean both variables work together to influence the outcome? doesn't confounder mean the same?
4) What's the usefulness of having interaction variables? I mean if gender*age
impacts the outcome, can I understand that gender
(individually) and age
(individually) impact the outcome?
5) I see that people usually create 2x2
tables called as strata for stratified analysis and compute risk ratio
and compare it with crude risk ratio
. But how can we do this for all variables which I think as confounders in my dataset? I know we can use tools like SPSS, STATA
etc but is it the only way to do? But then can't we do using multivariate regression?
6) Is it mandatory that all our continuous variables be converted into some categorical variable for analysis/confounder adjustment?
7) Any simple examples/explanation would be helpful as I couldn't find any tutorial for adjusting confounding during logistic regression and finding significant variables. I have been referring this though it's useful, some links are broken. lot of questions arise because I am neither a stats or biostats person. I usually build models using classic ML algorithms and now trying to learn all this.