Difference between the architectures of semantic and instance segmentation
My question is about the difference between the architectures of semantic segmentation and instance segmentation models. So, as far as I understand, a semantic segmentation model is making pixel-wise classification and, therefore, it has a dense layer at the end where the output dimension is number of labels (classes). The part that makes me confused is how instance segmentation models distinguish between the instances from same classes? How is the architecture of them?
Actually, I am studying on NLP and information extraction from documents. I recently trying to implement a model specified in a paper called Chargrid: Towards Understanding 2D Documents in which they do both instance and semantic segmentation and I could not understand the architecture. In the paper, they state that there are different fields like invoice number, amount, vendor name etc. Also there are line-items and there might me multiple of them. So, they say that, in order to differentiate between individual line-items they introduced a bounding-box regression branch in the decoder in addition to the semantic segmentation branch. I do not understand how bounding-box regression helps to identify individual items.