How to generate test set with no data-leakage using multiple columns

I am developing a fraud detection algorithm. Among other things, my dataset contains the phone number, email address and a few other fields that should uniquely identify a user (let's call them unique fields). In order to prevent data leakage between my training and test set I want to be sure that my test set contains only users that are completely new, meaning that they should not have any user whose unique fields matches any unique field of any user of the training set. And I am having trouble building this test set. Basically I am looking for a way to generate a unique ID based on the values of several columns, meaning that 2 rows should have the same unique ID if any of their unique fields matches. Do you have any solution in mind ? The answer could be in SQL or Pandas or any other python library, I can adapt.

The only solution I can think of is to start from a basic train_test_split and iteratively remove from the test set any row that have matching fields with the training set, but it's cumbersome and less elegant than the generation of a unique ID.

Topic data-leakage training sql scikit-learn pandas

Category Data Science

I ended up randomly browsing my dataframe and assigning each row to train or test set depending on its unique identifiers. It happens to be fast enough for my usecase (takes 1 minute for my 10M rows dataframe with 4 identifiers).

import random
from tqdm import tqdm

def train_test_split_identifiers(df, identifier_cols, target_test_size):

train_idx = []
train_values = {identifier_col : set() for identifier_col in identifier_cols}
test_idx = []
test_values = {identifier_col : set() for identifier_col in identifier_cols}
aside_idx = []

for row in tqdm(df.sample(frac=1.0).itertuples()):

    in_train = False
    in_test = False

    for i, identifier_col in enumerate(identifier_cols):
        if row[i + 1] in train_values[identifier_col]:
            in_train = True
        elif row[i + 1] in test_values[identifier_col]:
            in_test = True

    if not in_train and not in_test:
        if random.random() < target_test_size:
            for i, identifier_col in enumerate(identifier_cols):
                test_values[identifier_col].add(row[i + 1])
            for i, identifier_col in enumerate(identifier_cols):
                train_values[identifier_col].add(row[i + 1])
    elif in_train and not in_test:
        for i, identifier_col in enumerate(identifier_cols):
            train_values[identifier_col].add(row[i + 1])
    elif in_test and not in_train:
        for i, identifier_col in enumerate(identifier_cols):
            test_values[identifier_col].add(row[i + 1])
assert len(df) == len(test_idx + train_idx + aside_idx)

train = df.loc[train_idx]
test = df.loc[test_idx]

print(f'Train size = {round(100 * len(train_idx) / len(df), 2)} %')
print(f'Test size = {round(100 * len(test_idx) / len(df), 2)} %')
print(f'Left aside = {round(100 * len(aside_idx) / len(df), 2)} %')

for identifier_col in identifier_cols:
    assert len(set(train[identifier_col]).intersection(test[identifier_col])) == 0, 'Data leakage detected'

return train, test

EDIT: I came up with a much better solution using graphs, it is much faster and does not let any users aside. Basically you create 'true_user_id' with this method and then train_test_split on it

import networkx as nx

def get_unique_ids(df, id_col, compare_cols):

    print('Creating links between pairs of users')
    links = set()

    for col in compare_cols:
        df_self_merged = df.merge(df, on=col)
        links = links.union(set(df_self_merged.loc[df_self_merged[id_col+'_x'] != df_self_merged[id_col+'_y'], [id_col+'_x', id_col+'_y']].itertuples(index=False, name=None)))
    print('Building graph from links')
    G = nx.Graph(links)

    print('Adding users that have no links')
    G.add_nodes_from(set(df[id_col]) - set(G.nodes))

    print('Assigning a new unique ID to connected users')
    tuples = []
    for i, cluster in enumerate(nx.connected_components(G)):
        tuples.extend([(user, i) for user in cluster])

    return df.merge(pd.DataFrame(tuples), how='left', left_on=id_col, right_on=0)[1]

df['unique_id'] = get_unique_ids(df, 'user_id', ['email', 'phone_number', 'card_fingerprint'])

Simple solution would be to create a sudo Id column which is the concatenation of all unique identifier columns (e.g. mail@com9825403). You then sample the unique entries of that column to test and train.


Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.