One-hot vector for fixed vocabulary

Question

One-hot vector for fixed vocabulary

Mi.

2020年11月17日 08:33

given a vocabulary with $|V|=4$ and V = {I, want, this, cat} for example.

How does the bag-of-words representation with this vocabulary and one-hot encoding look like regarding example sentences:

You are the dog here
I am fifty
Cat cat cat

I suppose it would look like this

$V_1 = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$
$V_2 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$
$V_3=\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 \\ \end{pmatrix}$

But what exactly is the point of this representation? Does is show the weakness of one-hot encoding with a fixed vocabulary or did I miss something?

Topic bag-of-words one-hot-encoding

Category Data Science

Peter · Accepted Answer · 2020年11月17日 08:33

library(quanteda)

mytext <- c(oldtext = "I want this cat")
dtm_old <- dfm(mytext)
dtm_old

newtext <- c(newtext = "You are the dog here")
dtm_new <- dfm(newtext)
dtm_new

dtm_matched <- dfm_match(dtm_new, featnames(dtm_old))
dtm_matched

$V_1$

Document-feature matrix of: 1 document, 4 features (100.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   0

$V_2$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 1    0    0   0

$V_3$

Document-feature matrix of: 1 document, 4 features (75.0% sparse).
         features
docs      i want this cat
  newtext 0    0    0   3

Of course when using a "one hot" vectorizer, "cat" in $V_3$ would be 1 (instead of the count).

One-hot vector for fixed vocabulary

About