Multilabel classification for a learning to rank application

I am looking for some suggestions on Learning to Rank method for search engines. I created a dataset with the following data:

query_dependent_score, independent_score, (query_dependent_score*independent_score), classification_label

query_dependent_score is the TF-IDF score i.e. similarity b/w query and a document.

independent_score is the viewing time of the document.

There are going to be 3 classes:

  • 0 (not relevant),
  • 1 (kind of relevant),
  • 2 (most relevant)

I have a total of 750 queries and I collected top 10 results of each, so I have a total of 7500 data points.

I have been thinking of estimating a relevance function like:

w0 + w1*query_dependent_score + w2*independent_score + w3*(query_dependent_score*independent_score)

I can clearly see this is like a classification problem but I wanted some info on whether this is right way to approach this problem.

I referred to Machine learning technique to calculate weighted average weights? for some ideas.

Following is the code that I have written:

from sklearn.linear_model import LogisticRegression
import numpy as np

DATASET_PATH = "..."

search_data = np.genfromtxt(DATASET_PATH, delimiter=',', skip_header=1, usecols=(1, 2, 3, 4))
document_grades = search_data[:, 3:4]
document_signals = search_data[:, :3]  # This has 3 features.

total_rows = np.shape(search_data)[0]
split_point = int(total_rows * 0.8)

training_data_X, test_data_X = document_signals[:split_point, :], document_signals[split_point:, :]
training_data_y, test_data_y = document_grades[:split_point, :], document_grades[split_point:, :]

clf = LogisticRegression(multi_class="multinomial", solver="lbfgs")

clf.fit(X=training_data_X, y=training_data_y.ravel())

print(clf.classes_)  # [0, 1, 2]
print(clf.coef_)  # This is a 3 x 3 matrix?
print(clf.intercept_)  # An array of 3 elements?

Based on the sklearn's documentation coef_ should give me the values of w1, w2 and w3, and intercept_ should give me the value of w0.

But I have a matrix and an array for those weights. I am not sure how to get the values of the weights for the relevance function?

I am looking into learning to rank for the first time, so any suggestions are welcome.

Topic learning-to-rank scikit-learn

Category Data Science


In the multinomial mode, the docs specify that the outputs of coef_ and intercept_ are as you are seeing them: one output for each target class. The underlying model is three logistic regressions, whose outputs are softmax'ed (or with mode ovr, simply normalized).

As to the broader question, since your three output classes are ordered, you might benefit from using that information. Either just perform regression (assumes that the numeric 0,1,2 are meaningful) or use "ordinal regression."

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.