Is automatic feature detection feasible?

Question

Is automatic feature detection feasible?

Fabian Werner

2015年7月7日 17:50

I am searching for pointers to algorithms for feature detection.

EDIT: all the answers helped me a lot, I cannot decide which one I should accept. THX guys!

What I did:

For discrete variables (i.e. $D_i, E$ are finite sets) $X_i : \Omega \to D_i$ and a given data table $$ \begin{pmatrix}{} X_1 ... X_n X_{n+1} \\ x_1^{(1)} ... x_n^{(1)} x_{n+1}^{(1)} \\ ... \\ x_1^{(m)} ... x_n^{(m)} x_{n+1}^{(m)} \\ \end{pmatrix} $$ (the last variable will be the 'outcome', thats why I stress it with a special index) and $X, Y$ being some of the $X_1, ..., X_{n+1}$ (so if $X=X_a, Y=X_b$ then $D=D_a, E=D_b$) compute

$$H(X) = - \sum_{d \in D} P[X=d] * log(P[X=d])$$

$$H(Y|X) = - \sum_{d \in D} { P[X=d] * \sum_{e \in E} { P[Y=e|X=d] * log(P[Y=e|X=d]) } }$$

where we estimate $$P[X_a=d] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d\}|$$ and analogously $$P[X_a=d \cap X_b=e] = |\{j \in \{1, ..., m\} : x_a^{(j)} = d ~\text{and}~ x_b^{(j)}=e\}|$$ and then $$I(Y;X) = \frac{H(Y) - H(Y|X)}{\text{log}(\text{min}(|D|, |E|))}$$ which is to be interpreted as the influence of $Y$ on $X$ (or vice versa, its symmetric).

EDIT: A little late now but still:

This is wrong: ~~Exercise for you: show that if $X=Y$ then $I(X,Y)=1$.~~

This is correct: Exercise for you: show that if $X=Y$ then $I(X,X)=H(X)/log(|D|)$ and if $X$ is additionally equally distributed then $I(X,X)=1$.

For selecting features start with the available set $\{X_1, ..., X_n\}$ and a set 'already selected'$ = ()$ [this is an ordered list!]. We select them step by step, always taking the one that maximizes $$\text{goodness}(X) = I(X, X_{n+1}) - \beta \sum_{X_i ~\text{already selected}} I(X, X_i)$$ for a value $\beta$ to be determined (authors suggest $\beta = 0.5$). I.e. goodness = influence on outcome - redundancy introduced by selecting this variable. After doing this procedure, take the first 'few' of them and throw away the ones with lower rank (whatever that means, I have to play with it a little bit). This is what is described in this paper.

For computing the $I$ for continuous variables one needs to bin them in some way. More concretely, the inventors of 'I' suggest to take the maximal value over binning $X$ into $n_x$ bins, $Y$ into $n_y$ bins and $n_x \cdot n_y = m^{0.6}$, i.e. compute $$ \text{MIC}(X;Y) = \text{max}_{n_X \cdot n_Y \leq m^{0.6}} \left( \frac{I_{n_X, n_Y}(X;Y)}{log(\text{min}(n_X, n_Y)} \right)$$

where $I_{n_X, n_Y}(X;Y)$ means: compute the $I$ precisely as you did for discrete variables by treating $X$ as a discrete random variable after binning it into $n_X$ bins and analogously with $Y$.

===

ORIGINAL QUESTION

More precisely: I have a classification problem for one boolean variable, let's call this variable outcome.

I have lots of data and lots of features (~150 or so) but these features are not totally 'meaningless' as in image prediction (where every x and y coordinate is a feature) but they are of the form gender, age, etc.

What I did until now: from these 150 features, I guessed the ones that 'seem' to have some importance for the outcome. Still, I am unsure which features to select and also how to measure their importance before starting the actual learning algorithm (that involves yet more selection like PCA and stuff).

For example, for a feature f taking only finitely many values x_1, ..., x_n my very naive approach would be to compute some relation between

P(outcome==TRUE | f==x_1), ..., P(outcome==TRUE | f==x_n) and P(outcome==TRUE)

(i.e. the feature is important when I can deduce more information about the coutcome from it than without any knowledge about the feature).

Concrete question(s): Is that a good idea? Which relation to take? What to do with continuous variables?

I'm sure that I'm not the first one ever wondering about this. I've read about (parts of) algorithms that do this selection in a sort-of automated way. Can somebody point me into the right direction (references, names of algorithms to look for, ...)?

Topic featurization feature-selection

Category Data Science

Pablo Suau · Accepted Answer · 2015年5月29日 09:12

Feature selection is a very well established field in Machine Learning. The objective of feature selection algorithms is to select a subset of your feature set in order to maximize your system's prediction performance. There are two kind of feature selection approaches:

Filter methods: filter methods select features without taking into consideration any specific prediction algorithm. They are based in applying different kinds of measures (like mutual information or correlation) in order to evaluate the relative information that a feature or a set of features can provide about the output variable. Filter methods are faster than wrapping methods, but do not perform as well as wrapping methods.

Wrapper methods: these methods evaluate features or sets of features by using a specific classification/regression algorithm.

I'd suggest to read one of the seminal papers in the feature selection field as a starting point in order to get to know the basics:

Guyon, Isabelle; Elisseeff, André (2003). "An Introduction to Variable and Feature Selection". JMLR

j.a.gartner · Accepted Answer · 2015年5月28日 16:55

It is a good idea to do automatic feature vector selection for classification, and is widely done for decision trees.

If you are trying to find the relative importance of variables in classification schemes, a popular way of selecting variables is by performing entropy minimization. This a technique that is often used for constructing decision trees. Entropy is simply a way of numerically quantifying disorder of a system:

$H=\Sigma_x-p_x*log_2(p_x)$

For binary classification, it's simple to calculate entropy for every variable, moving the variables which minimize entropy to the top,

As an example, consider a simple case where a game exists when one must choose between a right and left door, one of which is red, one of which is blue. The color of the door changes from instance to instance (sometime left is red, sometimes right is red), but the prize is always behind the left door. Suppose also that over many trials, both left-right, and red-blue have equal number of people choosing those options. The entropy of the original system (P(prize)=.5, P(!prize)=.5) is 1. Supposing that we select a door based on color, we have:

$H=p_{red}*(H_{red})+p_{blue}*(H_{blue})=0.5*(-.5*log_2(.5)-.5*log_2(.5))+.5*(-.5*log_2(.5)-.5*log_2(.5))$

Here the probability of red and blue are both .5, and the entropy of each system is 1, so the Entropy of the new system is still 1. This means that no information was gained from choosing red or blue. Suppose instead we select on left right:

$H=p_{left}*(H_{left})+p_{right}*(H_{right})=0.5*(-1*log_2(1))+.5*(-1*log_2(1))$

And since log_2(1)=0, there is no disorder (i.e. a perfect predictor).

For continuous variables, it becomes a bit more complicated because you then have to select based on a floating threshold (or a few for a single variable). While the basic entropy calculations hold fast, tuning many thresholds on many variables becomes resource intensive, so at that point you'll want to look at something like spark's mllib.

damienfrancois · Accepted Answer · 2015年5月27日 10:08

What you are describing makes sense and relates to the naive Bayesian classifier. A useful tool in that respect is mutual information (MI) (and all its derivatives). MI was first proposed for feature selection by Battiti.

MI is based on marginal and joint density estimations so you will need some pdf estimators for your continuous variables, like histograms, or kernel-based density estimators.

Is automatic feature detection feasible?

What I did:

ORIGINAL QUESTION

Concrete question(s): Is that a good idea? Which relation to take? What to do with continuous variables?

About