Optimization of pandas row iteration and summation

i'm wondering if anyone can provide some input on improving the speed and calculations of a pandas result.

What i am trying to obtain is a summation of IDs in one table (player table) based on each row of a second table (UUID). Functionally each row needs to sum the total of the players table rows that are contained in its Active row and assign the UUID as the index to that row.

My initial thought was to loop row by row and calculate out my results but that has produced quite a slow result, and i suspect is not the optimal way that this could be accomplished. In the version below my estimate total time for the full dataset would be around 66 minutes. Running on a subsample of 10,000 takes around 20 seconds.

Would anyone have a better solution to calculating these results?

Thanks in advance!


UUID Table

This is a subset of the whole table

shape = (2060590, 2)


Player ID Table

This is a subset of the whole table

shape = (39,8)


Final Table


Code

# executes in ~20 seconds
df = None
for ix, i in enumerate(uuid_df[["UUID", "Active"]].sample(10000).itertuples(index=False)):
    # Get UUID for row
    _uuid = i[0]
    # Get list of "Active items" (these are the ones that will be summed)
    _active = i[1]

    # Create new frame by selecting everything from points table where the ID is in the Active List.
    # Sum selected values, convert to a dataframe with UUID as index and tranpose
    _dff = points_table_df.loc[points_table_df.index.isin(_active)].sum().to_frame(_uuid).T

    # Check if first dataframe, if not concat to existing one
    if df is None:
        df = _dff
    else:
        df = pd.concat([df, _dff])

Topic dataframe optimization pandas processing

Category Data Science


This could actually be done quickly and intuitively using linear algebra.

So consider your player as label binarized array (can be done with MultiLabelBinarizer) so you would expect an array of size (2060590, 39) containing 0 an 1, rearrange the columns similar as how you order the your player table (or the other-way around which ever is easier), basically such that first column of your new matrix correspond to the same player on the player table. Finally just apply matrix multiplication, and done.

This is an example using generated sample, but hopefully you get the idea of doing this.

import numpy as np
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

sample_active = pd.Series([[100,50,150,200],
                           [100,50,150],
                           [100,50],
                           [100]])
sample_df = pd.DataFrame()

sample_df['id'] = ['fadfsadsa', 'dsafsadf', 'dfsafsda', 'dasfasdfsaf']
sample_df['active'] = sample_active
## sample_df should look close to your original df

classes = [50,100,150,200]

player_df = pd.DataFrame({cl : np.random.uniform(0,1,size=5) for cl in classes}).T
player_df.columns = ['A','B','C','D','E'] 

sample_transformed = mlb.fit_transform(sample_active.values) ##apply multilabel binarizer

output = sample_transformed.dot(player_df.loc[mlb.classes_]) ##matrix multiply and get your required answer, use loc so the order will be similar as your binarized matrix.


new_df = pd.concat([sample_df['id'], pd.DataFrame(output)], axis = 1)
new_df.columns = ['id'] + list(player_df.columns)

For your case I think this should work :

mlb = MultiLabelBinarizer()
active_transformed = mlb.fit_transform(uuid_df['Active'])
output = active_transformed.dot(points_table_df.loc[mlb.classes_])
df = pd.concat([uuid_df[['UUID']], output], axis = 1)
df.columns = ['UUID'] + list(points_table_df.columns)

Try it!


If I understand your DataFrame schemas correctly, then this should do it. All these operations are vectorized, so they should be much faster than iterating over the dataframe.

# "explode" the values in the `Active` column to get a dataframe of (UUID, player_id) pairs
uuid_df = uuid_df.explode('Active').rename(columns={
    'Active': 'player_id'
})

# inner join with the player stats dataframe 
# (this assumes that `player_id` is the index of `points_table_df`)
joined_df = uuid_df.merge(points_table_df, left_on='player_id', right_index=True)

# group by the UUID and sum to get aggregate stats
stats_df = joined_df.groupby('UUID').sum()

# promote the UUID to the index:
stats_df = stats_df.set_index('UUID')

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.