difference between feature interactions and confounding variables

Let me define the problem space.

I am working a binary classification problem. I am trying to build a causal model as well as predictive model.

My aim is to find list of significant features (based on causal model) and use that to build a predictive model. I did refer the suggestions provided in this post and it was very much useful but I have few more questions due to my limitations with ML field.

I understood from literature that there are two ways to adjust/control for confounders. one is through study design phase and other is during modelling/analysis phase.

As I am working on retrospective data analysis, I can only adjust confounders during analysis phase.

We know that certain features like Age in a typical example like "gender causes heart disease" is a confounder.

1) So during analysis phase, we include age as a variable in our model. Similarly all the potential confounders that we could think of are put in the model as features. ex: X_train will have all the columns/features that I think of as potential confounders and then it is fed to the model (logistic regression). Am I right till here?

2) Does this mean our LR model is adjusted for confounders? How would you do confounding adjustment during logistic regression modelling phase? If we include all potential confounders in our model and if the coeff of already existing variables (gender) change by 10% or so, I understand that age is a confounder but does this also mean that LR is adjusted for confounders?

3) Then, why it is said that logistic regression doesn't consider feature interaction? Is feature interaction different from confounding? I understand that feature interaction is usually denoted as gender*age but does this mean both variables work together to influence the outcome? doesn't confounder mean the same?

4) What's the usefulness of having interaction variables? I mean if gender*age impacts the outcome, can I understand that gender (individually) and age(individually) impact the outcome?

5) I see that people usually create 2x2 tables called as strata for stratified analysis and compute risk ratio and compare it with crude risk ratio. But how can we do this for all variables which I think as confounders in my dataset? I know we can use tools like SPSS, STATA etc but is it the only way to do? But then can't we do using multivariate regression?

6) Is it mandatory that all our continuous variables be converted into some categorical variable for analysis/confounder adjustment?

7) Any simple examples/explanation would be helpful as I couldn't find any tutorial for adjusting confounding during logistic regression and finding significant variables. I have been referring this though it's useful, some links are broken. lot of questions arise because I am neither a stats or biostats person. I usually build models using classic ML algorithms and now trying to learn all this.

Topic causalimpact deep-learning logistic-regression statistics machine-learning

Category Data Science


Some remarks (as far as I understand your questions):

  1. In a causal model you need to reflect the "data generating process" (DGP). The DGP is a theoretical construct. You need to come up with an idea what is relevant for your research question, so what $X$ explain $y$ in a causal way. You can also include variables which are not so important. In tendence underspecification (exluding important variables) is a real problem. Overspecification can also be a problem but the consequences are less salient.
  2. I think the wording is unclear here. In some fields "adjusting for confounders" means to "control" for all relevant variables (aka. confounders). First thing to do: get the wording right and understand what you mean when you say "confounders".
  3. Where does this claim come from ("feature interaction"). In general, Logit is a model that is linear in the way you include variables, so if you think about interaction of two or more $x$ in your model, then no: Logit does not consider interaction unless you specify your model so to include interaction terms.
  4. An easy way of thingking about interactions is the interaction of a continuous variable $x_1$ with a dummy variable (1 or 0) $x_2$. In the model $$y=\beta_0+\beta_1 x_1$$ you only have one intercept ($\beta_0$) and one slope ($\beta_1$). In a model with interaction between $x_1$ and $x_2$ you have $$ y=\beta_0+\beta_1 x_1 + \beta_2 x_2$$ where there is the same slope for $x_1$ but a different intercept in case $x_2=1$. Another way to think about interactions is to add "squared" terms because a squared term is just the interaction $x_1 * x_1$.
  5. I don't understand this: needs a reference! I guess you speak about the inclusion of indicator variables (aka dummies) as interactions??!
  6. No you can use any numbers as $x$.
  7. Look here and read this book.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.