What is the difference in computational cost at inference time between object detection and semantic segmentation?

I am aware that YOLO (v1-5) is a real-time object detection model with moderately good overall prediction performance. I know that UNet and variants are efficient semantic segmentation models that are also fast and have good prediction performance.

I cannot find any resources comparing the inference speed differences between these two approaches. It seems to me that semantic segmentation is clearly a more difficult problem, to classify each pixel in an image, than object detection, drawing bounding boxes around objects in the image.

Does anyone have good resources for this comparison? Or a very good explanation to why one is computationally more demanding that the other?

Topic semantic-segmentation object-detection convolutional-neural-network computer-vision efficiency

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.