How to arrange ground-truth for anchor box representation in object detection

I am working on CharGrid and BERTGrid papers and have questions about bounding box regression decoder part. In the CharGrid paper, it states that there are two outputs from this branch: one with 2Na outputs and one with 4Na outputs. First one is for whether there is an object in bbox or not and the second one is for four bbox coordinates. Na is number of anchor boxes per pixel. I’ve got until this part. However, let’s say Na is 2. So this branch outputs 2 bbox for each pixel. Then the shapes of the two outputs would be (B, 4, H, W) and (B, 8, H, W), respectively. B is batch size, H and W are height and width of the document resized to a fixed size. Now, how do we compare this outputs with ground-truth data since we will use a loss function to guide the network. Because, shapes of the ground-truth tensors will always be (B, 2, H, W) and (B, 4, H, W) since, in the ground-truth, we have only one bbox for each pixel. But network outputs 2 bbox for each pixel. So how can I continue from that. I know it is about implementation but this is the part that makes me really confused. Thanks.

Topic information-extraction bert faster-rcnn object-detection nlp

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.