What is a good approach for embedding both textual and spatial features for document classification?
I am working on a document classifier that can perform the classification based on the document structure as well. My plan is to get the word embedding as well as the word coordinates and somehow combine the two features and pass it through a Graph Convolutional Network (GCN) to generate a graph embedding which I can then use to train a classifier. I was referencing this paper and they do data extraction by first getting the text embedding and image embedding and then combining them using elementwise addition. I was wondering what would be a good solution to combine both text features and positional features (x,y,height and width).