Active learning with mixture model cluster assignments - am I injecting bias here?

Suppose I have a dataset of people's phone numbers and heights, and I'm interested in learning the parameters $p_{girl}$, $p_{boy}=1-p_{girl}$, $\mu_{boy}$, $\mu_{girl}$, and overall $\sigma$ governing the distribution of peoples' heights. I don't have labels for boys or girls yet, but if I really want to, I can call the phone number and ask if the person is a boy or girl.

Procedure:

  1. Fit a Gaussian mixture model to heights via EM. Assign the greater of the $\mu$s to be $\mu_{boy}$.
  2. Pick a datapoint according to some scheme (more on this in a second) and call the associated phone number to get a boy/girl label
  3. Re-fit a Gaussian mixture model to heights, but in the E step, restrict the membership probabilities to 1 or 0, according to the labels received. Somehow enforce the constraint that $\mu_{girl} = \mu_{boy}$ (this may not be trivial).
  4. Repeat step 2 until tired of calling people

This seems to make sense to me, but only at an intuitive level. I'm worried that if my label querying scheme is not random, I'll be biasing my results.

  1. If I pick the point with the greatest cross-entropy, I'll call a bunch of people who have height 5'4.5, and probably get a 50/50 breakdown of girl/boy. After a couple of points, I think this could end up with two clusters sitting on top of each other.
  2. if my label querying scheme is to pick the points furthest from the decision boundary, it should lead to a nice separation, and it would probably help enforce $\mu_{boy} \mu_{girl}$ if that constraint in step 3 turns out to be hard to enforce. But after a couple of points, I'll be calling the least informative points. Also, perhaps my results will imply greater separation than there actually is.
  3. Picking points randomly seems so inefficient.

I'm not confident whether the degenerate situations in steps 1 and 2 will actually unfold, but my undergrad-level instinct for why a nonrandom sampling scheme could inject bias is that this smells just like missing data imputation, where the data are not missing at random.

Am I in trouble if I query labels by some scheme, or even arbitrarily? Or am I stuck with calling people randomly?

Topic bias missing-data expectation-maximization active-learning clustering

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.