PMI between lemma vs surface

Question

PMI between lemma vs surface

alvas

2018年8月8日 14:55

I was wondering whether it's possible to compute the some sort of pointwise mutual information between lemma and its surface form.

First if we assume,

p('to go') = count('to go') / sum(all lemmas)

p('went') = count('went') / sum(all words)

Breakpoint here, since every word comes with its respective lemma, we have the condition that

sum(all lemmas) == sum(all words)

The joint probability is also a little hard to normalize

# count of "went" being lemmatize to "to go
p('went', 'to go') = count('went'-'to go') / sum(all words)

The catch here is since "went" always lemmatize to "to go", we have special condition of

count('went'-'to go')  == count('went')

The reverse is not true though since the "to go" can be realized as different surface forms.

In that case, it's really awkward, if we put it all together

 p('went', 'to go') = p('went') = count('went') / sum(all words)
 p('to go') = count('to go') / sum(all words)


 PMI('went', 'to go') 
 =  p('went', 'to go') / (p('to go') * p('went') 
 =  p('went') / (p('to go') * p('went') 
 =  1 / p('to go')

But that would assign the same information value for all surface forms given the same lemma.

I think I've made some mistakes in how I'm accounting for the joint or independent probabilities. Could anyone advise how to approach an information score for lemma and its surface?

Or is the information score between lemma and its surface uninformative thus the conclusion of inverse probability of the lemma?

One possible solution is to change the normalization value for the joint probability where:

p('went', 'to go') = count('went') / count('to go')

Then the PMI equation would be:

PMI('went', 'to go') 
=  p('went', 'to go') / (p('to go') * p('went') 
=  (p('went') / (p('to go')) / (p('to go') * p('went') 
=  1 / p('to go')**2

Still the final mutual information disregard the surface probability =(

Topic mutual-information probability nlp

Category Data Science

score 1 · Accepted Answer · 2018年8月8日 14:55

The result that you are calling awkward is indeed what we expect from the definition of PMI. Here are some identities involving PMI:

$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$

Let $x=$'went' and $y=$'to go' and use the last expression. $p("to go"|"went") = 1$, and voilà, only $p("to go")$ remains in the denominator of the expression.

PMI between lemma vs surface

About