Mutate with dynamic column names dplyr

Hi I have this dataset (It has many more columns)

   media   brand    radio        tv            cinema  
   chr   dbl    dbl       dbl            dbl      
  radio      0      0             0                0     
  tv         0      0             0                0     
  cinema     0      0             0                0    
  tv         0      0             0                0   
  radio      0      0             0                0     
  tv         0      0             0                0   

I want to obtain the following(Assign a 1 to each column based on the value of media column):

        media   brand    radio        tv            cinema  
       chr   dbl    dbl       dbl            dbl      
      radio      0      1             0                0     
      tv         0      0             1                0     
      cinema     0      0             0                1    
      tv         0      0             1                0   
      radio      0      1             0                0     
      tv         0      0             1                0   

Do you have any idea on how to do it?As I have many more columns, I'm trying to find a solution using dynamic variable assignment. Thanks

Topic dplyr preprocessing r machine-learning

Category Data Science


Using mutating and ifelse you get the expected output:

# data:
df = tibble(
            media = c('radio','tv','cinema','tv','radio','tv'),
            brand = rep(0,6),
            radio = rep(0,6),
            tv = rep(0,6),
            cinema = rep(0,6)
            )
> df
# A tibble: 6 x 5
  media  brand radio    tv cinema
  <chr>  <dbl> <dbl> <dbl>  <dbl>
1 radio      0     0     0      0
2 tv         0     0     0      0
3 cinema     0     0     0      0
4 tv         0     0     0      0
5 radio      0     0     0      0
6 tv         0     0     0      0

df %>%
  mutate(radio = 1 * (media == 'radio'),
            tv = 1 * (media == 'tv'),
        cinema = 1 * (media == 'cinema')
        )

# A tibble: 6 x 5
  media  brand radio    tv cinema
  <chr>  <dbl> <dbl> <dbl>  <dbl>
1 radio      0     1     0      0
2 tv         0     0     1      0
3 cinema     0     0     0      1
4 tv         0     0     1      0
5 radio      0     1     0      0
6 tv         0     0     1      0

What you are trying to do is called one hot encoding or dummy encoding. Even as this may be possible to accomplish in dplyr (together with tidyr), I would recommend using the function one_hot() from the library mltools. For an explanation, see the following code snippets:

# set up sample data
df <- data.frame(
  id = 1:4,
  media = factor(c("radio", NA, "tv", "tv"), levels=c("radio", "tv", "cinema"))
)

One Hot Encoding with tidyr:

The following code snippet shows how you would accomplish one hot encoding with dplyr and tidyr (copied from this comment):

library(dplyr)
library(tidyr)

tib <- as_tibble(df)
tib %>% 
  mutate(i = 1) %>% 
  spread(media, i, fill = 0)

However, this code will give you the following table:

# A tibble: 4 x 4
 id    radio   tv   `<NA>`
<int>  <dbl>  <dbl>  <dbl>
  1      1      0      0
  2      0      0      1
  3      0      1      0
  4      0      1      0

Note that there is no cinema column present, even though we defined a respective level in our sample data-frame. Also, the spread()-function created an own column for values with NA's, treating them as a separate category.

One Hot Encoding with mltools (recommended):

library(data.table)
library(mltools)

dt <- data.table(df)
one_hot(dt)

This code will give you the following table:

id  color_radio  color_tv  color_cinema
1          1          0            0
2         NA         NA           NA
3          0          1            0
4          0          1            0

As you can see, the column cinema is now present even though our sample data-frame did not contain any observations of it. Also, there isn't an NA-column present anymore. The NA from the second row is instead visible in all columns (which makes way more sense, since you don't know, what category this observation IS, so you cannot know what category it IS NOT).

In addition, this code is not only easier to understand it also runs about two to three times as fast.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.