What is the difference between a Categorical Column and a Dense Column?

In Tensorflow, there are 9 different feature columns, arranged into three groups: categorical, dense and hybrid.

From reading the guide, I understand categorical columns are used to represent discrete input data with a numerical value. It gives the example of a categorical column called categorical identity column:

ID   Represented using one-hot encoding
0    [1, 0, 0, 0]
1    [0, 1, 0, 0]
2    [0, 0, 1, 0]
3    [0, 0, 0, 1]

But you also have a dense column called indicator column, which 'wraps'(?) a categorical column to produce something that looks almost identical:

Category (from category column)   Represented as...
0                                 [1, 0, 0, 0]
1                                 [0, 1, 0, 0]
2                                 [0, 0, 1, 0]
3                                 [0, 0, 0, 1]

So both 'categorical' and 'dense' columns seems to be able to represent discrete data, so that's not what distinguishes one from another.

My question is: In principle, what are the difference between a 'categorical column' and a 'dense column'?


I have read this answer that explains the difference between indicator columns and categorical identity columns, but I am looking for a more generic answer distinguishing categorical and dense columns.

Topic estimators tensorflow machine-learning

Category Data Science


Sparse vs Dense

The 'categorical column' is a sparse column. Sparse and dense columns (or matrices) are in a way opposites of each other.

Sparse columns usually contain a lot of zeros. Whereas, dense columns have more non-zero entries. This matters, because the way they are stored and processed can differ.

Sparse

If we take your sparse example:

ID   Represented using one-hot encoding
0    [1, 0, 0, 0]
1    [0, 1, 0, 0]
2    [0, 0, 1, 0]
3    [0, 0, 0, 1]

What information needs to be stored to reconstruct this matrix? All you need is:

  • The number of columns and rows
  • Value for every non-zero entry

So, by having:

n_columns = 4
n_rows = 4
non_zero_entries = {
    (0, 0): 1,
    (1, 1): 1,
    (2, 2): 1,
    (3, 3): 1
}

you can reconstruct it. This means you don't need to store all the 16 integers to reconstruct the matrix.

In Tensorflow, the 'categorical column' matrix, has the special property that it is a one-hot encoding, meaning that one and only one entry in the columns is non-zero. However, the sparse principle doesn't need to be one-hot.

For example:

matrix = [
    [0, 0,   0, 0],
    [0, 2,   0, 1],
    [0, 0, 0.5, 0] 
]

Can be reconstructed with:

n_columns = 4
n_rows = 3
non_zero_entries = {
    (1, 1): 2,
    (1, 3): 1,
    (2, 2): 0.5
}

Imagine a matrix of size $100 \times 100$. There are $10,000$ values in it. Now, suppose only $1$ entry is non-zero (for whatever reason). Then, this large $10,000$ values matrix, can be represented by:

n_columns = 100
n_rows = 100
non_zero_entries = {
    (x, y): 1
}

Dense

One the other hand, dense columns are not stored by defining only the non-zero entries. The 0s are also being stored.

While more memory intensive, they are less computationally expensive because they don't need to be reconstructed. Also, often the 0s are likely to be changed over time, and if there is already an int being stored, no new memory locations need to be assigned.

Usability

Not all algorithms can process a sparse matrix. Often, matrix operations require the whole vector be present. Depending on the implementation, the assumptions that non-present elements are zero is not there, therefore an explicit zero needs be stored and passed along.

Like you mentioned in your questions:

But you also have a dense column called indicator column, which 'wraps'(?) a categorical column to produce something that looks almost identical

This is true. What they represent can be exactly the same. However, how they are stored and the efficiency of the implementations differ.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.