How to find median/average values between data frames with slightly different columns?

Question

How to find median/average values between data frames with slightly different columns?

PhilipTsv

2021年12月14日 18:48

I am trying to combat run-to-run variance of the data I collect by combining the data from different runs and finding the mean/average. The problem is that in each run there is a chance that some of the features may not appear:

   x    y    z
0  0    2    2 
1  0    1    3 
2  5    3    0
3  1    1    0
4  0    2    0

   x    y    d
0  1    0    2 
1  1    1    3 
2  0    4    2  
3  0    2    0
4  0    2    1  

   z    y    
0  0    2
1  0    1 
2  0    2 
3  1    0
4  3    0

As you can see from this example, the rows are always consistent, but some runs might provide less columns than the rest. Therefore in a theoretical dataframe where all the columns are averaged, in some columns the values would have to be divided by a lower number than others (in this case the values in the y column will have to be divided by 3, but in the x column - by 2).

Bonus question: Is there a way make this row-specific: do the same thing, but not take into account the 0s, since in my case that indicates no data, so it might interfere with the results (y for row 0 has one zero, so the average will be $(2+2)\over 2$, whereas in row 1 it would be $(1+1+1)\over3$.

Topic groupby data-analysis dataframe pandas data-cleaning

Category Data Science

Oxbowerce · Accepted Answer · 2021年12月14日 18:48

Assuming that you have the data stored in separate dataframes, you can use a combination of pandas.concat and pandas.DataFrame.groupby to achieve what you are looking for:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    "x": [0,0,5,1,0],
    "y": [2,1,3,1,2],
    "z": [2,3,0,0,0]
})

df2 = pd.DataFrame({
    "x": [1,1,0,0,0],
    "y": [0,1,4,2,2],
    "z": [2,3,2,0,1]
})

df3 = pd.DataFrame({
    "y": [2,1,2,0,0],
    "z": [0,0,0,1,3]
})

df = (
    # combine dataframes into a single dataframe
    pd.concat([df1, df2, df3])
    # replace 0 values with nan to exclude them from mean calculation
    .replace(0, np.nan)
    .reset_index()
    # group by the row within the original dataframe
    .groupby("index")
    # calculate the mean
    .mean()
)

index	x	y	z
0	1	2	2
1	1	1	3
2	5	3	2
3	1	1.5	1
4	nan	2	2

How to find median/average values between data frames with slightly different columns?

About