In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?
When learning Add-1 smoothing, I found that somehow we are adding 1 to each word in our vocabulary, but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me give an example to explain.
Example:
Assume we have a corpus of three sentences:
John read Moby Dick
, Mary read a different book
, and She read a book by Cher
.
After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence John read a book, i.e. to find $P(John\; read\; a\; book)$
To differentiate John appearing anywhere in a sentence from its appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(sJohn\; read\; a\; book\backslash s)$ after introducing two more words $s$ and $\backslash s$, indicating start of a sentence, and end of a sentence respectively.
Finally, we arrive at the
$P(sJohn\; read\; a\; book\backslash s)$ as $P(John|s)P(read|John)P(a|read)P(book|a)P(\backslash s|book)=\frac{1}{3}\frac{1}{1}\frac{2}{3}\frac{1}{2}\frac{1}{2}$
My Question: Now, to find $P(Cher\; read\; a\; book)$, using Add-1 smoothing (Laplace smoothing) shouldn't we add the word 'Cher' that appears first in a sentence? And to that, we must add $s$ and $\backslash s$ in our vocabulary. With this, our calculation becomes:
$P(Cher|s)P(read|Cher)P(a|read)P(book|a)P(\backslash s|book)=\frac{0+1}{3+13}\frac{0+1}{1+13}\frac{2+1}{3+13}\frac{1+1}{2+13}\frac{1+1}{2+13}$
The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens - start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator. Wondering what I am missing here.
Topic stanford-nlp ngrams language-model nlp
Category Data Science