Incorrect example of applying Bayes theorem
I have been reading the book The Data Science Design Manual (by Steven S. Skiena) and I came across an example that explained how the Bayes theorem can be applied that confused me and made me suspect it might be wrong. The example is the following:
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$ Suppose A is the event that person x is actually a terrorist, and B is the result of a feature-based classifier that decides if x looks like a terrorist. When trained/evaluated on a data set of 1,000 people, half of whom were terrorists, the classifier achieved an enviable accuracy of, say, 90%. The classifier now says that Skiena looks like a terrorist. What is the probability that Skiena really is a terrorist? The key insight here is that the prior probability of “x is a terrorist” is really, really low. If there are a hundred terrorists operating in the United States, then P(A) = 100/300,000,000 = 3.33 × 10−7 . The probability of the terrorist detector saying yes, P(B) = 0.5, while the probability of the detector being right when it says yes P(B|A) = 0.9. Multiplying this out gives a still very tiny probability that I am a bad guy, $$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{(0.9)(3.33x10^{-7})}{0.5} = 6x10^{-7} $$
However, $P(B) = 0.5$ doesn't seem correct to me. $P(B)$ is supposed to be the probability of the terrorist detector saying yes when exercised on a person randomly selected from the United State's population (e.g. Skiena). If I understand correctly, this $0.5$ used by the author is the percentage of terrorists in the evaluation data set for the classifier, which is not the same thing for several reasons:
- This is a sample that is not randomly selected to be equivalent to some population (the one Skiena is selected from), but specifically selected to contain the aforementioned ratio of terrorists.
- This ratio is not the ratio of people in the evaluation dataset that look like terrorists (i.e. the probability the classifier would say yes for a random person in that sample), but the ratio of actual terrorists in the sample.
My understanding is that in order to calculate $P(B)$ more properly one would have to draw a random sample from the United States population (assuming this is where Skiena is picked up from), then run the classifier on them and calculate the percentage of people the classifier said yes for.
Is my thinking correct or am I missing something?
Topic bayesian probability data statistics
Category Data Science