Market Basket Analysis - Data Modelling

Imagine that I've the following dataset:

 Customer_ID    Product_Desc
    1   Jeans
    1   T-Shirt
    1   Food
    2   Jeans
    2   Food
    2   Nightdress
    2   T-Shirt
    2   Hat
    3   Jeans
    3   Food
    4   Food
    4   Water
    5   Water
    5   Food
    5   Beer

I need to make the consumer behaviour and predicte what products are associated. For do that I think that will a good strategy make the relationships first and then count the occurrences (don't know if anyone have a better idea).

The first step is to conclude this relationships:

Jeans-T-Shirt-Food
Jeans-Food-Nightdress-T-Shirt-Hat
Jeans-Food
Food-Water
Water-Food-Beer

How can do this? With Apache PIG or with Spark?

Many thanks!!!

Topic market-basket-analysis scala apache-spark

Category Data Science


Let's start with your problem definition: "a good strategy make the relationships first and then count the occurrences".

That is, roughly, the basic strategy that market basket analysis algorithms use. However, algorithms like Apriori or FPGrowth are specially designed to analyze such datasets (at scale) and infer the inherent association rules between items across all baskets. My recommendation would be to use one of these to gather the relationships between items purchased, instead of reinventing them; especially because you'll face a lot of the hard problems these algorithms already solve (namely the large search space when generating combinations of basket items).

You can use any of several libraries or languages to do this, namely R, Python, etc. Doing this in Spark would be pretty simple using MLLib, your workflow would be something like: 1) choose an algorithm, e.g. FPGrowth; 2) prepare your data to fit the format required by FPGrowth (each transaction should be an array of basket items); 3) run FPGrowth and output its frequent itemsets.

There's a good example of this at Spark's website:

import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD

val data = sc.textFile("data/mllib/sample_fpgrowth.txt")

//prepare the data to use with FPGrowth
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

//create and run the model
val fpg = new FPGrowth()
  .setMinSupport(0.2)
  .setNumPartitions(10)
val model = fpg.run(transactions)

//output the frequent itemsets (items frequently bought together)
model.freqItemsets.collect().foreach { itemset =>
  println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

You may use groupByKey or combineByKey in Spark

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.