logistic regression or density estimation for binary dependent variable and binary (or categorical) features

I have a binary dependent variable $t$ and categorical features. We can even simplify to binary features since I can one-hot encode the categorical variables. In practice the one-hot encoding induces collinearity in the binary features so for simplicity let's pretend we only have $D$ binary features. The purpose is to estimate the probability of $t=1$.

In principle, I can use logistic regression. But, given the categorical nature of the input data they actually define a table of $2^D$ cells. So I could instead just estimate the proportion of $t=1$ samples in each cell (for example using maximum likelihood estimation).

I think this should be similar to the logistic regression in that both approaches assume a binomial likelihood function. However the logistic regression assumes that the log odds are a linear function of the input variable (which is not assumed under the density estimation procedure). I think this assumption is not critical here given the binary nature of the inputs.

So the question is,

  1. Are the two approaches different?
  2. If yes, in what aspect are they different?

One difference would be of course that the estimation method for logistic regression is iterative so in some cases there might be convergence issues. One would be tempted to say that as $D$ increases many cells in the table will be (near to) empty. But I think logistic regression would suffer as well in those cases.

As additional questions (connected to the first one):

  1. Is there anything wrong in my line of thought?
  2. Which of the two approaches should perform better?

Topic binary-classification density-estimation logistic-regression binary categorical-data

Category Data Science


Background

I think the general model for both approaches mentioned in the question is a fully generative model:

$ p(C_1|x) = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1)+p(x|C_2)p(C_2)}$

This model can be rewritten as:

$p(C_1|x) = \frac{1}{1+\frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}} = \frac{1}{1+exp[-ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}]} = \sigma(a(x))$,

where $a(x)= ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$ and $\sigma(a)=\frac{1}{1+exp(-a)}$ is the logistic sigmoid function.

Depending on how we model the log ratio of class conditional probabilities and class priors (the argument of the sigmoid) we recover different models.

If we assume $p(x|C_i)$ belongs to the exponential family with a scale parameter shared across classes then:

$p(C_1|x)=\sigma(w_0+w^Tx)$

For example if $p(x|C_i)$ is assumed Gaussian with equal covariance across classes we get a linear discriminant. If the equality of covariances assumption is relaxed we recover a quadratic discriminant (for which $a(x)$ is no longer a linear function of the input but, guess it, a quadratic one).

If we don't make further assumptions (beyond exponential family with shared scale) but directly estimate the coefficients in $a(x) = w_0+w^T x$, we get a logistic regression (this is thus a kind of degenerate case since it's not a generative model).

Now to the case in question, if our input data are binary $p(x|C_i)$ is Bernoulli. Without further assumptions we have $2^D$ categories in which we have to estimate the parameter $\theta$ of the Bernoulli (one category is actually redundant due to summation constraint so that there are $2^D-1$ parameters in total). This would correspond to the "density estimation" approach outlined in the question.

Notice that if we assume that the features are independent given the class, we recover instead the naive Bayes model for which $a(x)$ is again a linear function of the input (and has $D$ parameters):

$p(C_1|x)=\sigma(\sum_i^D x_i ln \theta_i + (1-x_i)ln(1-\theta_i)+ln p(C_1))$

Conclusions

  • Both approaches outlined in the question belong to the fully generative model.
  • The logistic regression assumes that $p(x|C_i)$ belongs to the exponential family with a scale parameter shared across classes and directly estimates the coefficients of the corresponding linear predictor.
  • The density estimation with binary features assumes a Bernoulli distribution but does not assume independence of the features or shared scale parameter, thus the predictor is not restricted to be linear
  • As for the question of which approach should perform better under which conditions it's hard to tell. Given the larger number of parameters of the density estimation approach it needs more data than logistic regression to adequately grasp the same effect size. In practice when $D$ grows large the probability that some categories will be nearly empty becomes substantial. On the other hand if the assumption of linearity is highly violated in the dataset the density approach should have better performance than logistic regression (especially if the number of samples is not an issue). This might give a general idea but it should be tested on the specific dataset of interest.

References

The above description has been elaborated based on:

  • Bishop, C., Pattern Recognition, chap. 4
  • Murphy, K., Machine Learning. A probabilistic perspective, chap. 3

One of the primary differences between fitting a logistic regression model and proportion for each combination is the difference between fitting a single overall model and many separate individual models. Fitting a logistic regression will create a single model that will attempt to accommodate the outcomes for all combinations. Fitting a separate model for each combination will better model each option but will be more complex and might lead to overfitting.

Choosing between the modeling options depends on the goal of the project.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.