What is a good approach for embedding both textual and spatial features for document classification?

I am working on a document classifier that can perform the classification based on the document structure as well. My plan is to get the word embedding as well as the word coordinates and somehow combine the two features and pass it through a Graph Convolutional Network (GCN) to generate a graph embedding which I can then use to train a classifier. I was referencing this paper and they do data extraction by first getting the text embedding and image embedding and then combining them using elementwise addition. I was wondering what would be a good solution to combine both text features and positional features (x,y,height and width).

Topic graph-neural-network text-classification feature-engineering word-embeddings classification

Category Data Science


When text is available as scanned image:

  1. Divide your image into small grids.
  2. Assign each grid a row/column number like (i,j)
  3. Now to your word vector append 2 more cells which are the row and column number of cell to which the word belongs.

When text is available as Have the document in the form of html. Then have the embeddings for entire html DOM tree of that document, which would include both tags and the actual text. This way html tags would be giving the spatial/positional information.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.