PMI between lemma vs surface
I was wondering whether it's possible to compute the some sort of pointwise mutual information between lemma and its surface form.
First if we assume,
p('to go') = count('to go') / sum(all lemmas)
p('went') = count('went') / sum(all words)
Breakpoint here, since every word comes with its respective lemma, we have the condition that
sum(all lemmas) == sum(all words)
The joint probability is also a little hard to normalize
# count of "went" being lemmatize to "to go
p('went', 'to go') = count('went'-'to go') / sum(all words)
The catch here is since "went" always lemmatize to "to go", we have special condition of
count('went'-'to go') == count('went')
The reverse is not true though since the "to go" can be realized as different surface forms.
In that case, it's really awkward, if we put it all together
p('went', 'to go') = p('went') = count('went') / sum(all words)
p('to go') = count('to go') / sum(all words)
PMI('went', 'to go')
= p('went', 'to go') / (p('to go') * p('went')
= p('went') / (p('to go') * p('went')
= 1 / p('to go')
But that would assign the same information value for all surface forms given the same lemma.
I think I've made some mistakes in how I'm accounting for the joint or independent probabilities. Could anyone advise how to approach an information score for lemma and its surface?
Or is the information score between lemma and its surface uninformative thus the conclusion of inverse probability of the lemma?
One possible solution is to change the normalization value for the joint probability where:
p('went', 'to go') = count('went') / count('to go')
Then the PMI equation would be:
PMI('went', 'to go')
= p('went', 'to go') / (p('to go') * p('went')
= (p('went') / (p('to go')) / (p('to go') * p('went')
= 1 / p('to go')**2
Still the final mutual information disregard the surface probability =(
Topic mutual-information probability nlp
Category Data Science