Statistical machine translation word alignment for FR-ENG and ENG-FR: what is p(e) and p(f)?

I'm currently trying to implement this paper, but am struggling to understand some of the math here. I'm pretty sure I understand how to implement the E-step, but for the M-step, I'm confused on how to compute the M-step. It says just before section 3.1 that $p_1(x, z; \theta_1) = p(e)p(a, f|e; \theta_1)$, and then the same for $p_2$ but with $e$ and $f$ swapped. The second part of this makes sense to me, but what is $p(e)$ or $p(f)$? From my understanding, $e, f$ are sentences in the bi-text. So how would we compute the probability of a sentence?

It says earlier that $p(e)$ and $p(f)$ are arbitrary distributions that don't affect the optimization problem, but then how do we compute $p_1(x, z; \theta_1)$?

Thanks!

Topic markov-hidden-model probability expectation-maximization machine-translation machine-learning

Category Data Science


You are right that $p(e)$ is the probability of the English sentence. Estimating the probability of a sentence is achieved by a language model.

This kind of machine translation model is known as the noisy channel model. The noisy channel model says that given a french sentence $f$, its best English translation is

$$e^* = \arg\max_{e\in E} p(e)p(f|e)$$

In this equation the $p(e)$ is the language model. Back in the era of IBM models (which are built upon the noisy channel approach), it is usually an n-gram based language model, calculated as (assuming bigram) $$p(e_1e_2...e_n)=p(e_1|<s>)p(e_2|e_1)p(e_3|e_2)...p(</s>|e_n)$$

And $p(f|e)$ is the translation model where you need to use the EM algorithm to solve. Inside the EM algorithm you do not update the language model parameters, so yes, $p(e)$ and $p(f)$ don't affect the optimization problem.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.