Multilabel classification for a learning to rank application

Question

Multilabel classification for a learning to rank application

Animesh Pandey

2019年8月22日 18:02

I am looking for some suggestions on Learning to Rank method for search engines. I created a dataset with the following data:

query_dependent_score, independent_score, (query_dependent_score*independent_score), classification_label

query_dependent_score is the TF-IDF score i.e. similarity b/w query and a document.

independent_score is the viewing time of the document.

There are going to be 3 classes:

0 (not relevant),
1 (kind of relevant),
2 (most relevant)

I have a total of 750 queries and I collected top 10 results of each, so I have a total of 7500 data points.

I have been thinking of estimating a relevance function like:

w0 + w1*query_dependent_score + w2*independent_score + w3*(query_dependent_score*independent_score)

I can clearly see this is like a classification problem but I wanted some info on whether this is right way to approach this problem.

I referred to Machine learning technique to calculate weighted average weights? for some ideas.

Following is the code that I have written:

from sklearn.linear_model import LogisticRegression
import numpy as np

DATASET_PATH = "..."

search_data = np.genfromtxt(DATASET_PATH, delimiter=',', skip_header=1, usecols=(1, 2, 3, 4))
document_grades = search_data[:, 3:4]
document_signals = search_data[:, :3]  # This has 3 features.

total_rows = np.shape(search_data)[0]
split_point = int(total_rows * 0.8)

training_data_X, test_data_X = document_signals[:split_point, :], document_signals[split_point:, :]
training_data_y, test_data_y = document_grades[:split_point, :], document_grades[split_point:, :]

clf = LogisticRegression(multi_class="multinomial", solver="lbfgs")

clf.fit(X=training_data_X, y=training_data_y.ravel())

print(clf.classes_)  # [0, 1, 2]
print(clf.coef_)  # This is a 3 x 3 matrix?
print(clf.intercept_)  # An array of 3 elements?

Based on the sklearn's documentation coef_ should give me the values of w1, w2 and w3, and intercept_ should give me the value of w0.

But I have a matrix and an array for those weights. I am not sure how to get the values of the weights for the relevance function?

I am looking into learning to rank for the first time, so any suggestions are welcome.

Topic learning-to-rank scikit-learn

Category Data Science

Ben Reiniger · Accepted Answer · 2019年7月22日 17:38

In the multinomial mode, the docs specify that the outputs of coef_ and intercept_ are as you are seeing them: one output for each target class. The underlying model is three logistic regressions, whose outputs are softmax'ed (or with mode ovr, simply normalized).

As to the broader question, since your three output classes are ordered, you might benefit from using that information. Either just perform regression (assumes that the numeric 0,1,2 are meaningful) or use "ordinal regression."

Multilabel classification for a learning to rank application

About