In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?

Question

In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?

KGhatak

2022年4月19日 08:00

When learning Add-1 smoothing, I found that somehow we are adding 1 to each word in our vocabulary, but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me give an example to explain.

Example:

Assume we have a corpus of three sentences:

John read Moby Dick, Mary read a different book, and She read a book by Cher.

After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of a sentence John read a book, i.e. to find $P(John\; read\; a\; book)$

To differentiate John appearing anywhere in a sentence from its appearance at the beginning, and likewise for book appearing at the end, we rather try to find $P(sJohn\; read\; a\; book\backslash s)$ after introducing two more words $s$ and $\backslash s$, indicating start of a sentence, and end of a sentence respectively.

Finally, we arrive at the

$P(sJohn\; read\; a\; book\backslash s)$ as $P(John|s)P(read|John)P(a|read)P(book|a)P(\backslash s|book)=\frac{1}{3}\frac{1}{1}\frac{2}{3}\frac{1}{2}\frac{1}{2}$

My Question: Now, to find $P(Cher\; read\; a\; book)$, using Add-1 smoothing (Laplace smoothing) shouldn't we add the word 'Cher' that appears first in a sentence? And to that, we must add $s$ and $\backslash s$ in our vocabulary. With this, our calculation becomes:

$P(Cher|s)P(read|Cher)P(a|read)P(book|a)P(\backslash s|book)=\frac{0+1}{3+13}\frac{0+1}{1+13}\frac{2+1}{3+13}\frac{1+1}{2+13}\frac{1+1}{2+13}$

The 13 added to each numerator is due to the unique word count of the vocabulary which has 11 English words from our 3-sentence corpus plus 2 tokens - start and end of a sentence. In few places, I see 11 is added instead of 13 to the numerator. Wondering what I am missing here.

Topic stanford-nlp ngrams language-model nlp

Category Data Science

Brian Spiering · Accepted Answer · 2020年10月9日 13:33

It depends on the definition of vocabulary (V).

Most teaching examples only include words in the vocabulary for simplicity. Start and stop of sentences tags can also included in the vocabulary.

Vocabulary can also include punctuation, or stop words can be removed from the vocabulary.

In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?

About