Document understanding - sentence length prediction
the subject is taken almost verbatim from this paper https://arxiv.org/pdf/2108.02923.pdf. One of the tasks , is to be able to tell, in a document, if 2 words are part of the same phrase. For e.g. if the KEY is Name and the VALUE against it is MAD MAX, can we identify MAD and MAX as belonging to the same entity. They show up as separate entities because the ROI algorithm (RCNN and family) will more or less draw bounding boxes around MAD and MAX separately. Now in order to solve this problem, the authors of the above paper, take both the textual embedding of the words along with visual embedding ( which are generated by passing the cropped contour through any pre trained vision network ) and pre train their architecture by asking it to predict the length of the sentence seen in the image.
This should work quite well for data that the model has seen BUT will have issues when it encounters newer formats of documents where the language model is going to give rather poor results. In order to make it totally independent of the textual embeddings, i was thinking of using ONLY visual features and that too in a different fashion. Would request the audience here to opine on the same
the top of the image is supposed to show a document format X that has certain KEYs (DATE, NAME ..) and VALUEs. Also please note that unless the distance between pixels is really low, the contours drawn on the words would mostly capture one word. For e.g. STEVEN and SEGAL would definitely occur as 2 separate contours. The algo would start forming pairs of contours, stitch them together. Once this has been done for the current line and its neighbourhood (lets say 1 line above and 1 below) , it would pass these images through a pre trained model to extract visual features. These features would be passed to a multi head attention block, post which , there would be a few dense layers and the final dense layer would apply sigmoid and generate a multi label output (since the GT also would be a 1 hot encoded multi label vector ) ..for e.g. if you were to look at the image above, the GT vector would be 0, 0, 0 , 1 since here only STEVEN SEGAL are 1 entity ..the rest are NOT