Filter for top 10 highest values of group in dataset (in R)

Question

Filter for top 10 highest values of group in dataset (in R)

n.baes

2022年4月25日 04:07

Context: I am trying to find the top 10 highest values of count in my data frame conditional on them falling within the years 1970-1979. My data frame looks as below:


id    lemma    year    count    
1     word1    1970    737 
2     word2    1971    767
3     word3    1972    988

etc...

Attempt:

#1970s
df_n_maxcount_1970s - df_n %% filter(year  1980) %% slice_max(count, n=40)

#1990s
df_n_maxcount_1980s - df_n %% filter(year == 1980:1989) %% slice_max(count, n=40)

This has worked pretty well, but there's a level of manual work and in 1990 I had to increase n to 200 because there were many duplicates (i.e., the same word was appearing many times so I wasn't getting 10 unique words when searching for the top 10 with n=10).

Question: Can I automate the code so that I end up with one dataframe arranged as below? (of course, word 1 in 1970 might not equal word1 in 1980 and there would be 10 rows for each decade value for the top 10 words arranged by count). OR at least 5 separate dataframes with top 10 counts of words per decade?


decade lemma count    
1970   word1 100
1970   word2 99
1970   word3 98
1980   word1 100
1990   word1 100
2000   word1 100
2010   word1 100
```

Topic data-wrangling r

Category Data Science

Oxbowerce · Accepted Answer · 2021年8月24日 09:28

This can be done using dplyr and a combination of group_by and aggregations functions. Something like this should work:

library(dplyr)

df_n <- data.frame(
  id = c(1, 2, 3),
  lemma = c("word1", "word2", "word3"),
  year = c(1970, 1971, 1972), 
  count = c(737, 767, 988)
)

df_n %>%
    # create column that specifies the decade
    mutate(decade = year - year %% 10) %>%
    group_by(decade, lemma) %>%
    # add up counts for duplicate words within groups specified above
    summarise(count = sum(count)) %>%
    group_by(decade) %>%
    # select top 10 records based on count within groups specified above
    top_n(10, count)

Filter for top 10 highest values of group in dataset (in R)

About