Why does it not need to set test group when using 'rank:pairwise' in xgboost?

I'm new for learning-to-rank. I'm trying to learn the Learning to rank example provided by xgboost. I found that the core code is as follows in rank.py.

train_dmatrix = DMatrix(x_train, y_train)
valid_dmatrix = DMatrix(x_valid, y_valid)
test_dmatrix = DMatrix(x_test)

train_dmatrix.set_group(group_train)
valid_dmatrix.set_group(group_valid)

params = {'objective': 'rank:pairwise', 'eta': 0.1, 'gamma': 1.0,
               'min_child_weight': 0.1, 'max_depth': 6}
xgb_model = xgb.train(params, train_dmatrix, num_boost_round=4,
                           evals=[(valid_dmatrix, 'validation')])
pred = xgb_model.predict(test_dmatrix)

Group data is used in both training and validation sets. But test set prediction does not use group data. I also looked at some explanations to introduce model output such as What is the output of XGboost using 'rank:pairwise'?.

Actually, in Learning to Rank field, we are trying to predict the relative score for each document to a specific query.

My understanding is that if the test set does not have group data, no query is specified. How does the model output the relative score to the specified query?

And I've tried adding test_dmatrix.set_group(group_test). The output results of the two methods are in good agreement like:

[ 1.3535978  -2.9462705   0.86084974 ... -0.23594362  0.712791
 -1.633297  ]

So my question as follows:

  1. Why does it not need to set test group when using 'rank:pairwise' in xgboost?

  2. How can I get label to the specified group query based on the forecasting score results?

Can anybody explain it to me? Thanks in advance.

Topic learning-to-rank xgboost python machine-learning

Category Data Science


The output is a score that can be used to rank the samples, and the point in this sort of ranking problem is that you'll only care about ranking samples within the same group (which you think of as being results from a given query).

But that can be safely left to you on the testing set. (Indeed, you might as well only run the prediction for each group separately. You might think about the output in your case as assuming that the test set is all from a single query.) For scoring on the test set, it might matter what the specified groups are, but not for just making predictions.

For training, the group data is needed so the algorithm knows not to calibrate the rankings for intergroup comparisons.

See also:
How fit pairwise ranking models in xgBoost?
https://github.com/dmlc/xgboost/blob/master/doc/tutorials/input_format.rst#group-input-format


I will try to answer your questions:

  1. The train/test grouping is a common practice in Machine Learning/Data Science. The objective of this separation is to present some cases (training) so the algorithm can learn the model without memorizing it (overfitting), this means that the model solves the training cases and then gives a solution with the model to the cases in the test dataset. In that way, the solution is general for all the cases. The case in rank:pairwise is the same: You model your training dataset and apply it to the test dataset (of which you don't know the output).

When you have your model applied to the test dataset, you get a solution, which you compare to the solution ($Y$) of your test dataset. In that way you have a real solution and a model solution. In Data Science, the comparison of both is the real capacity of your model.

2. How can I get label to the specified group query based on the forecasting score results?

The function

pred = xgb_model.predict(test_dmatrix)

must give you the label you are looking for. What's wrong with this code?

Note: Is a good but not so common practice to verify with train/test grouping if your model doesn't overfit, and after verifying it doesn't overfit to get again a model with all the dataset together.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.