Computing probabilities in Plackett-Luce model

I am trying to implement a Plackett-Luce model for learning to rank from click data. Specifically, I am following the paper: Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank.

The objective function is the reward function similar to the one used in reinforcement learning

$R_d$ is the reward for document d, $\pi(k \vert d)$ is the probability of document d placed at position k for a given query q. $w_k$ is the weight of position $k$.

Similar to reinforcement learning, the importance sampled variant of the loss function is used, given by

$c_{i}(d)$ is the click indicator, and $\rho(d)$ is the logging propensity, generated from the logging policy, given by:

Now the problem is to estimate $\pi(k \vert d)$ empirically, from the logged data.

I am estimating it from logged data using the following estimate:

$\pi(k \vert d) = \frac{1}{N} \sum_{i=1}^{N} I \left[ rank(d \vert y_i) = k \right]$.

Here $y_i$ is the $i$th ranking sampled for a given query q, and I am basically counting a number of times document d was shown at position k.

When using this estimate to optimize the counterfactual ranking loss, it's not working very well. Also, the estimates come out as zero for some documents which did not appear within position $k$ for some queries.

I have access to the 'true' $\pi(k \vert d)$ from the logging policy model (small MLP). I am not sure though how to use it to estimate this probability, given the logged data.

Topic bayesian ranking reinforcement-learning deep-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.