Group_by 2 variables and pivot_wider distribution based on 2 others

Question

Group_by 2 variables and pivot_wider distribution based on 2 others

DataGuy23

2022年3月17日 10:00

Performing some calculations on a dataframe and stuck trying to calculate a few percentages. Trying to append 3 additional columns added for %POS/NEG/NEU. E.g., the sum of amount col for all observations w/ POS Direction in both Drew A/total sum of all amounts for Drew **

Name     Rating   Amount    Price    Rate   Type    Direction
Drew     A        455       99.54    4.5    white   POS
Drew     A        655       88.44    5.3    white   NEG
Drew     B        454       54.43    3.4    blue    NEU
Drew     B        654       33.54    5.4    blue    POS
Drew     C        754       54.43    4.3    green   POS
Jon      A        454       65.23    3.4    blue    NEG
Jon      B        954       86.34    4.3    blue    NEG
Jon      B        545       34.54    4.4    green   NEG
Jon      C        454       65.45    3.4    green   POS
Jon      C        544       65.55    4.4    blue    NEU
Nick     A        675       54.33    3.4    white   POS
Nick     A        565       65.33    3.4    white   POS
Nick     B        343       54.44    6.4    blue    POS
Nick     C        656       65.33    4.3    green   NEG
Nick     C        655       94.44    3.3    green   NEU

To Get the Following Output Calculation for POS/NEG/NEU columns

Name   Rating   sum  count percent wm_price wm_rate mode_type POS NEG NEU
Drew    A                                                    .3735
Drew    B                                                    .3728
Drew    C
Jon     A
Jon     B
Jon     C
Nick    A
Nick    B
Nick    C

Here's what i got so far but im stuck implementing the pivot_wider for calculating/appending the % direction (POS/NEG/NEU) for each rating category of Each Name Any feedback appreciated!

df - df %% group_by(Name, Rating) %% summarize(sum_rating = sum(Amount), count = n(), wm_Price = weighted.mean(Price,Amount), wm_Rate = weighted.mean(Rate,Amount), mode_Type = mode(Type)) %% mutate(pct_rating = sum_rating/sum(sum_rating)) %% pivot_wider(names_from = Direction, values_from = Amount/sum Amount)

Topic groupby dplyr data-wrangling data-cleaning r

Category Data Science

Ben Norris · Accepted Answer · 2020年6月1日 18:18

I cannot get pivot_wider to work here either, but I can replicate your described operation without it.

df %>% group_by(Name) %>%                                      # Group only by Name first
  mutate(Total = sum(Amount)) %>%                              # Total Amount by Name
  group_by(Name, Rating) %>%                                   # Now perform your calculations 
  summarize(sum_rating = sum(Amount),
        count = n(),
        wm_Price = weighted.mean(Price, Amount),
        wm_Rate = weighted.mean(Rate, Amount),
        mode_Type = mode(Type),
        POS = sum(Amount * (Direction == "POS")) / max(Total), # This is % POS
        NEU = sum(Amount * (Direction == "NEU")) / max(Total), # This is % NEU
        NEG = sum(Amount * (Direction == "NEG")) / max(Total)  # This is % NEG
        ) %>%
   mutate(pct_rating = sum_rating / sum(sum_rating))

Here is the output

# A tibble: 9 x 11
# Groups:   Name [3]
  Name  Rating sum_rating count wm_Price wm_Rate mode_Type   POS   NEU   NEG
  <fct> <fct>       <int> <int>    <dbl>   <dbl> <fct>     <dbl> <dbl> <dbl>
1 Drew  A            1110     2     93.0    4.97 white     0.153 0     0.220
2 Drew  B            1108     2     42.1    4.58 blue      0.220 0.153 0    
3 Drew  C             754     1     54.4    4.3  green     0.254 0     0    
4 Jon   A             454     1     65.2    3.4  blue      0     0     0.154
5 Jon   B            1499     2     67.5    4.34 blue      0     0     0.508
6 Jon   C             998     2     65.5    3.95 green     0.154 0.184 0    
7 Nick  A            1240     2     59.3    3.4  white     0.428 0     0    
8 Nick  B             343     1     54.4    6.4  blue      0.119 0     0    
9 Nick  C            1311     2     79.9    3.80 green     0     0.226 0.227
# ... with 1 more variable: pct_rating <dbl>

This method works because r can do math on logical values, treating them as TRUE = 1 and FALSE = 0. Thus, if you want to count the number of occurrences of POS in the Direction column, you could sum a logical vector:

sum(Direction == 'POS')

If you want to know the percent of the rows that contain POS, then use the mean:

mean(Direction == 'POS')

You can even weight that mean by another vector:

weighted.mean(Direction == 'POS', Amount)

The results of my calculation do not match your sample output, so I want to check to make sure I understand what you want the column POS to contain for each row. I understand your desired calculation to be as follows: the Amount of POS for Drew and A (455) divided by the total Amount for Drew (455 + 655 + 454 + 654 + 754 = 2972). Thus, 455 / 2972 = 0.153. I cannot replicate the 0.3735 value you have in your sample output.

Group_by 2 variables and pivot_wider distribution based on 2 others

About