Test data relevance to a model (covariate shift)

I am trying to design an algorithm that will allow to calculate the relevance of test data to a trained model. This can be done by checking if predictor variables have a different distribution in train and test data (covariate shift).

Main idea: If there exists a covariate shift, then upon mixing train and test we’ll be able to classify the origin of each data point (whether it is from test or train) with good accuracy.

I define a 'relevance weight' RW_train of test sample x_test in the train set, as the probability of the test sample x_test to be in the train set.

RW_train(x_test) = P_train(x_test)

To test this idea I generate data (as in this example) that has a clear covariate shift - train and test. I further split the train set into validation and train sets that have the same distribution. The test set has a covariate shift.

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as AUC

n_samples_X = 5000
n_samples_TEST = 500

x = 11*np.random.random(n_samples_X)- 6.0
y = x**2 + 10*np.random.random(n_samples_X) - 5
all_train_df = pd.DataFrame({'x':x, 'y':y})
train_df, valid_df = train_test_split(all_train_df, test_size=0.1, random_state=0)

x = 2*np.random.random(n_samples_TEST) - 6.0
y = x**2 + 10*np.random.random(n_samples_TEST) - 5
test_df = pd.DataFrame({'x':x, 'y':y})

This is how train, test and validation data looks like:

This function trains classifier and calculates individual test weights:

def test_weights(test, train, target):
    Calculate test weights
    
    # Adding a column to identify whether a row comes from train or not
    test['is_train'] = 0
    train['is_train'] = 1 
    
    # Combining test and train data
    df_combine = pd.concat([train, test], axis=0, ignore_index=True)
    
    # Dropping ‘target’ column as it is not present in the test
    df_combine = df_combine.drop(target, axis =1)
    
    # Labels
    y = df_combine['is_train'].values 
    
    # Covariates or our independent variables
    x = df_combine.drop('is_train', axis=1).values 
    tst, trn = test.values, train.values
    
    # Predict the labels for each row in the combined dataset. 
    m = RandomForestClassifier(n_jobs=-1, max_depth=5, min_samples_leaf = 5)
    predictions = np.zeros((y.shape[0],2)) #creating an empty prediction array
    
    # Stratified fold to ensure that percentage for each class is preserved 
    # and we cover the whole data once. 
    # For each row the classifier will calculate the probability of it belonging to train.
    skf = StratifiedKFold(n_splits=20, shuffle=True, random_state=100)
    for fold, (train_idx, test_idx) in enumerate(skf.split(x, y)):
        X_train, X_test = x[train_idx], x[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
 
        m.fit(X_train, y_train)
        predictions[test_idx] = m.predict_proba(X_test) #calculating the probability
        
    # High AUC (greater than 0.8) implies strong covariate shift between train and test.
    auc =  AUC(y, predictions[:, 1])
    
    predictions_test = predictions[len(trn):]
    # Test relevance weights equal to the probability 
    # of test sample to be in train set      
    test_relevance_weights = predictions_test[:,1]

    return auc, test_relevance_weights, predictions

For test data with covariate shift:

auc, test_relevance_weights, probs = test_weights(test_df, train_df, 'y')

I get AUC = 0.908712 which indicates that train and test sets are well separated, as expected for the test set with covariate shift.

In this case distribution of test relevance weights has most weights around 0.65 contrary to my expectation of most weights to be less than 0.5:

sns.distplot(test_relevance_weights, kde=False)

For validation set that has no covariate shift:

auc, test_relevance_weights, predictions = test_weights(valid_df, train_df, 'y')

AUC = 0.4825437 as expected for two similar distributions with no covariate shift.

And distribution of test relevance weights now has most weights around 0.9 also as expected:

Questions:

  1. Why test set with covariate shift still has samples with high relevance (probability) to the train set?
  2. How to improve this algorithm to calculate individual test sample relevance to the train set?
  3. Other algorithms?

Topic test machine-learning-model training classification

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.