R summarise with condition

I have customer data with the products they purchased and the purchase date.

I want to extract a result that shows each customer and the first two fruits they purchased.

My actual set has 90000 rows with 9000 unique customers. I have tried groupby and summarise functions but I would like to be able to use summarise with condition like we use select with a where clause. Thanks for your suggestions

Topic dplyr r

Category Data Science


Here a solution using data.table

First order the data.table by customer and date

Then group by customer and select the frist two fruits

> df[order(customer,date)][,.(fruit1=fruit[1],fruit2=fruit[2]),by=customer] 
   customer fruit1 fruit2
1:        A orange banana
2:        B  apple  apple
3:        C banana banana

Sample data

> df <- data.table(
+ customer = c('A','A','C','C','B','B','C','B','A'),
+ fruit = c('orange','apple','banana','orange','apple','banana','banana','apple','banana'),
+ date = c(as.Date('2018-05-04'),as.Date('2018-07-09'),as.Date('2018-01-02'),as.Date('2018-01-03'),as.Date('2018-01-02'),
+ as.Date('2018-04-05'),as.Date('2018-01-02'),as.Date('2018-01-06'),as.Date('2018-06-01'))
+ )
> df
   customer  fruit       date
1:        A orange 2018-05-04
2:        A  apple 2018-07-09
3:        C banana 2018-01-02
4:        C orange 2018-01-03
5:        B  apple 2018-01-02
6:        B banana 2018-04-05
7:        C banana 2018-01-02
8:        B  apple 2018-01-06
9:        A banana 2018-06-01

If you want a dplyr solution, you can try this:

yourdata %>% 
      mutate(date = paste(date, "-2018", sep = ""), # add year to date
             date = as.Date(date, format = "%d-%b-%Y")) %>% # save date in date format
      arrange(date) %>% # sort by date
      group_by(customer) %>%
      slice(1:2) %>% # keep only first two rows (fruits) per customer
      mutate(date = c("fruit1", "fruit2")) # change date variable to fruit1/fruit2
      spread(key = date, value = fruit) %>% # spread data

A shorter code version (condensing the mutate-part):

yourdata %>% 
      mutate(date = as.Date(paste(date, "-2018", sep = ""),
                            format = "%d-%b-%Y")) %>%
      arrange(date) %>% # sort by date
      group_by(customer) %>%
      slice(1:2) %>% # keep only first two rows (fruits) per customer
      mutate(date = c("fruit1", "fruit2")) %>% # change date variable to fruit1/fruit2
      spread(key = date, value = fruit) # spread data

Here is an example on the iris dataset

t(sapply(by(iris$Sepal.Length,iris$Species,function(x){x[1:2]}),as.numeric))

Where species is your customer, and Sepal.Length is your fruit.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.