Is automatic feature detection feasible?
I am searching for pointers to algorithms for feature detection.
EDIT: all the answers helped me a lot, I cannot decide which one I should accept. THX guys!
What I did:
For discrete variables (i.e. $D_i, E$ are finite sets) $X_i : \Omega \to D_i$ and a given data table $$ \begin{pmatrix}{} X_1 ... X_n X_{n+1} \\ x_1^{(1)} ... x_n^{(1)} x_{n+1}^{(1)} \\ ... \\ x_1^{(m)} ... x_n^{(m)} x_{n+1}^{(m)} \\ \end{pmatrix} $$ (the last variable will be the 'outcome', thats why I stress it with a special index) and $X, Y$ being some of the $X_1, ..., X_{n+1}$ (so if $X=X_a, Y=X_b$ then $D=D_a, E=D_b$) compute
$$H(X) = - \sum_{d \in D} P[X=d] * log(P[X=d])$$
$$H(Y|X) = - \sum_{d \in D} { P[X=d] * \sum_{e \in E} { P[Y=e|X=d] * log(P[Y=e|X=d]) } }$$
where we estimate $$P[X_a=d] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d\}|$$ and analogously $$P[X_a=d \cap X_b=e] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d ~\text{and}~ x_b^{(j)}=e\}|$$ and then $$I(Y;X) = \frac{H(Y) - H(Y|X)}{\text{log}(\text{min}(|D|, |E|))}$$ which is to be interpreted as the influence of $Y$ on $X$ (or vice versa, its symmetric).
EDIT: A little late now but still:
This is wrong:
Exercise for you: show that if $X=Y$ then $I(X,Y)=1$.
This is correct: Exercise for you: show that if $X=Y$ then $I(X,X)=H(X)/log(|D|)$ and if $X$ is additionally equally distributed then $I(X,X)=1$.
For selecting features start with the available set $\{X_1, ..., X_n\}$ and a set 'already selected'$ = ()$ [this is an ordered list!]. We select them step by step, always taking the one that maximizes $$\text{goodness}(X) = I(X, X_{n+1}) - \beta \sum_{X_i ~\text{already selected}} I(X, X_i)$$ for a value $\beta$ to be determined (authors suggest $\beta = 0.5$). I.e. goodness = influence on outcome - redundancy introduced by selecting this variable. After doing this procedure, take the first 'few' of them and throw away the ones with lower rank (whatever that means, I have to play with it a little bit). This is what is described in this paper.
For computing the $I$ for continuous variables one needs to bin them in some way. More concretely, the inventors of 'I' suggest to take the maximal value over binning $X$ into $n_x$ bins, $Y$ into $n_y$ bins and $n_x \cdot n_y = m^{0.6}$, i.e. compute $$ \text{MIC}(X;Y) = \text{max}_{n_X \cdot n_Y \leq m^{0.6}} \left( \frac{I_{n_X, n_Y}(X;Y)}{log(\text{min}(n_X, n_Y)} \right)$$
where $I_{n_X, n_Y}(X;Y)$ means: compute the $I$ precisely as you did for discrete variables by treating $X$ as a discrete random variable after binning it into $n_X$ bins and analogously with $Y$.
===
ORIGINAL QUESTION
More precisely: I have a classification problem for one boolean variable, let's call this variable outcome
.
I have lots of data and lots of features (~150 or so) but these features are not totally 'meaningless' as in image prediction (where every x and y coordinate is a feature) but they are of the form gender, age, etc.
What I did until now: from these 150 features, I guessed the ones that 'seem' to have some importance for the outcome. Still, I am unsure which features to select and also how to measure their importance before starting the actual learning algorithm (that involves yet more selection like PCA and stuff).
For example, for a feature f
taking only finitely many values x_1, ..., x_n
my very naive approach would be to compute some relation between
P(outcome==TRUE | f==x_1)
, ..., P(outcome==TRUE | f==x_n)
and P(outcome==TRUE)
(i.e. the feature is important when I can deduce more information about the coutcome from it than without any knowledge about the feature).
Concrete question(s): Is that a good idea? Which relation to take? What to do with continuous variables?
I'm sure that I'm not the first one ever wondering about this. I've read about (parts of) algorithms that do this selection in a sort-of automated way. Can somebody point me into the right direction (references, names of algorithms to look for, ...)?
Topic featurization feature-selection
Category Data Science