One-hot vector for fixed vocabulary

given a vocabulary with $|V|=4$ and V = {I, want, this, cat} for example.

How does the bag-of-words representation with this vocabulary and one-hot encoding look like regarding example sentences:

  1. You are the dog here
  2. I am fifty
  3. Cat cat cat

I suppose it would look like this

  1. $V_1 = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$

  2. $V_2 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$

  3. $V_3=\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \\ \end{pmatrix}$

But what exactly is the point of this representation? Does is show the weakness of one-hot encoding with a fixed vocabulary or did I miss something?

Topic bag-of-words one-hot-encoding

Category Data Science


library(quanteda)

mytext <- c(oldtext = "I want this cat")
dtm_old <- dfm(mytext)
dtm_old

newtext <- c(newtext = "You are the dog here")
dtm_new <- dfm(newtext)
dtm_new

dtm_matched <- dfm_match(dtm_new, featnames(dtm_old))
dtm_matched

$V_1$

Document-feature matrix of: 1 document, 4 features (100.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   0

$V_2$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 1    0    0   0

$V_3$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   3

Of course when using a "one hot" vectorizer, "cat" in $V_3$ would be 1 (instead of the count).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.