Association Mining - is buying Independent?

I have a problem. I can't not solve this exerciese. What is the best way to solve this exerciese? What are the approaches for this kind of exerciese?

The following table summarizes transactions in a supermarket where customers bought tomatoes and/or mozzarella cheese or neither.

Is buying mozzarella independent of buying tomatoes in the data given above? If they are not independent, explain whether they are positively or negatively correlated, i.e. does buying one of them increase or decrease the probability of buying the other?

As you can see I calculated the lift for lift(Moz = Tom) = 1,33 And I calculated lift(Moz = NoTom) = 0,5.

So I think they are not independent, and it is positively correlated

Topic data association-rules data-mining

Category Data Science


You are correct. They are not independent and they are positively correlated.

Let $A$ and $B$ (or $X$ and $Y$) be two events for stating general theorems and $M$ and $T$ be the events "customer purchases mozzarella" and "customer purchases tomatoes" in this specific example. We will use $\wedge$ to mean "and" so that $M \wedge T$ is the event "customer purchases mozzarella and tomatoes".

The simplest way to check independence is directly from the definition. $A$ and $B$ are independent if $P(A \wedge B) = P(A)P(B)$. From the data table we have $$P(M \wedge T) = 2000 / 5000 = 0,4$$ $$P(M) = 2500 / 5000 = 0,5$$ $$P(T) = 3000 / 5000 = 0,6$$ $$P(M)P(T) = 0,5 \times 0,6 = 0,3$$

Since 0.4 does not equal 0.3 we conclude that the events are not independent.

I have to say that I've never seen the term "lift" before. But its definition is perfectly reasonable and we could calculate independence in those terms. We can rearrange the definition of independence to give $$P(A \wedge B) = P(A)P(B)$$ $$\frac{P(A \wedge B)}{P(A)P(B)} = 1$$ $$\text{lift} = 1$$ to see that two events are independent if the associated lift is exactly 1. Your notes correctly show that lift can be rewritten as $$\frac{\frac{P(A \wedge B)}{P(A)}}{P(B)}$$

Here we have a lift of $$0,4 / (0,5 \times 0,6) = 1,33...$$ or $$0,8 / 0,6 = 1,33...$$ as you correctly calculated in your notes. Since the lift is not exactly 1 we conclude that the two events are not independent.

I would have approached the question of correlation via conditional probabilities, as follows. Two events, $A$ and $B$ are positively correlated if $P(A | B) > P(A)$. That is to say, two events $A$ and $B$ are correlated if the probability of $A$ given $B$ is greater than the unconditional probability of $A$. In other words, if we discover that event $B$ has occurred then the chances that event $A$ has occurred increase.

The conditional probability is defined as $$P(A | B) = \frac{P(A \wedge B)}{P(B)}$$ So in this example $P(M | T) = P(M \wedge T) / P(T) = 0,4 / 0,6 = 0,66...$ whereas $P(M) = 0,5$, so we conclude that $M$ and $T$ are positively correlated.

But hang on! Surely correlation is supposed to symmetric. So let's test whether $P(T | M) > P(M)$. We have $P(T | M) = 0,4 / 0,5 = 0,8$ and $P(T) = 0,6$ so, yes, they are correlated.

A lucky escape? We can be more smug than that. Let's rewrite our definition of correlation $$P(M | T) > P(M)$$ $$\frac{P(M | T)}{P(M)} > 1$$ $$\frac{\frac{P(M \wedge T)}{P(T)}}{P(M)} > 1$$ $$\frac{P(M \wedge T)}{P(T)P(M)} > 1$$

If we approach correlation the other way we get the same result

$$P(T | M) > P(T)$$ $$\frac{P(T | M)}{P(T)} > 1$$ $$\frac{\frac{P(T \wedge M)}{P(M)}}{P(T)} > 1$$ $$\frac{P(T \wedge M)}{P(M)P(T)} > 1$$ which is the same as above because both the and operator and multiplication are commutative.

As I'm sure you've noticed, $$\frac{P(M \wedge T)}{P(T)P(M)} > 1$$ is simply asking whether lift is above 1. That's probably the approach that your instructor was expecting you to take.

A couple of additional points to fully close the loop with your notes. You correctly calculated lift(Moz => NoTom) = 0,5 but that information was not required to answer the question because you had already calculated lift(Moz => Tom). Support({X} -> {Y}) looks very much like a definition of $P(X \wedge Y)$ and Confidence({X} -> {Y}) looks like $P(Y | X)$.

I have two final notes to place this all in a broader context.

First, my definition of correlation fits well to your definition of correlation as "does buying one of them increase or decrease the probability of buying the other". But our definitions are not completely standard and the word "correlation" is usually associated with some specific correlation measure -- most often Pearson's correlation coefficient but also things like Kendall's rank correlation coefficient. Having said that, our definition is totally reasonable and I'm fairly confident that we could prove that, for example, Kendall's rank correlation coefficient is positive if and only if our definition of positive correlation above is positive.

Second, inspired by the definition of lift in probabilistic terms in your notes, I've conflated historical frequencies given in the table with probabilities. This is standard practice in much of data science and certainly in exercises but it is not completely uncontroversial. Probabilities are about the future and you have data about the past. The extent to which you can infer future probabilities from past data is philosophically unresolved. But that is not something to worry about in this situation.


Given you have two categorical variables and the associated contingency table, one option is to calculate the joint and marginal probabilities:

$$ P = \frac{count}{total}$$

Probabilities Tomatoes No Tomatoes Row
Mozzarella 0.4 0.1 0.5
No Mozzarella 0.2 0.3 0.5
Column 0.6 0.4 0.1

The probabilities then can be used to answer questions about the data - If a person has bought tomatoes, is the person more likely to buy Mozzarella?

A person who has bought tomatoes is four times as likely to buy mozzarella compared to a person who has not bought tomatoes (i.e., 0.4 vs 0.1). There is a strong positive relationship between the two purchases.

There are many other possible statistical analyses that can be conducted. Examples include odds ratio, phi coefficient, tetrachoric correlation coefficient, and hypothesis testing with Chi-squared.


Market Basket Analysis is the way to go about it.

Market Basket Analysis is a technique which identifies the strength of association between pairs of products purchased together and identify patterns of co-occurrence. A co-occurrence is when two or more things take place together.

Market Basket Analysis creates If-Then scenario rules, for example, if item A is purchased then item B is likely to be purchased. The rules are probabilistic in nature or, in other words, they are derived from the frequencies of co-occurrence in the observations. Frequency is the proportion of baskets that contain the items of interest. The rules can be used in pricing strategies, product placement, and various types of cross-selling strategies.

Here is the calculation :

Support({Moz}--{Tom} = Transaction containing both moz and tom / total number of transaction

  = 2000/ 5000  = 2/5 = 0.4

Confidence({Moz}--{Tom}) = Transaction containing both moz and tom / total number of transaction containing Moz
= 2000 / 2500 = 2/2.5 = 0.8

Lift({Moz}--{Tom}) = (Transaction containing both moz and tom / total number of transaction containing Moz ) / Fraction of transaction containing Y

= 0.8 / (3000/5000)

= 0.8/0.6 = 1.33

For your table this association mining rule is strong. But as the relation is not commutative please do same calculation from P(tom--> Moz) and if values of support, confidence and lift is smaller that does not indicate strong relationship

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.