Scikit-learn with a custom scoring function using a 'feature'

Question

Scikit-learn with a custom scoring function using a 'feature'

ML_fan

2022年2月10日 05:24

0

I am trying to use a new metric called 'SERA' (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9

Here is what the paper tells in brief. To calculate SERA a feature known as 'relevance' defined by the user is required for each feature-label pair. Relevance varies from 0 to 1. 0 for not relevant and 1 for highly relevant.

This is the procedure for the calculation of SERA. Relevance is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.

Here is the code I have written in python for sklearn using make_scorer.

import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer

def calc_sera(y_true, y_pred, x_relevance=None):

    # creating a list from 0 to 1 with 0.001 interval
    start_range = 0
    end_range = 1
    interval_size = 0.001

    list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
    list_1.append(start_range)
    list_1.append(end_range)
    epsilon = sorted(list_1, key=lambda x: float(x))

    # Initiating lists to store relevance(phi) and squared-error relevance (ser)
    relevance = []
    ser = []

    # Converting the dataframe to a numpy array
    rel_arr = x_relevance.to_numpy()
    # selecting a phi value
    for phi in epsilon:
        relevance.append(phi)
        error_squared_sum = 0
        for i in rel_arr:
            # Getting the subset of the training data
            if i = phi:
                # Error calculation
                error_squared_sum += (y_true - y_pred)**2
        ser.append(error_squared_sum)

    # squared-error relevance area (sera)
    # numerical integration using simps(y, x)

    sera = simps(ser, relevance)

    return sera

score = make_scorer(calc_sera, x_relevance=catalyst_explorer.static_dataset['Relevance'], greater_is_better=False)

I ran this code but I get errors like this.

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray return array(a, dtype, copy=False, order=order) job exception: scoring must return a number, got [.....] (class 'numpy.ndarray') instead. (scorer=score)

Can anyone please check my code to see whether it is right and try to help me fix the problem. I am not sure that I am entering the relevance values correctly to the function. Thank you.

Topic scoring regression scikit-learn python machine-learning

Category Data Science

Scikit-learn with a custom scoring function using a 'feature'

About