Understanding 'scale_boxes' in YOLO Algorithm of CNN

I'm studying Andrew NG's Convolutional Neural Networks and am in Week 3 of the course which deals with object detection using YOLO algorithm . I don't understand one section in the programming assignment that uses a function called 'scale_boxes' . This is what is described about the function in the course materials.

*There're a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which we have provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh) which converts the yolo box coordinates (x,y,w,h) to box corners' coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape) YOLO's network was trained to run on 608x608 images. If you are testing this data on a different size image--for example, the car detection dataset had 720x1280 images--this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.*

And the function scale_boxes itself is defined as :

def scale_boxes(boxes, image_shape):
     Scales the predicted boxes in order to be drawable on the image
    height = image_shape[0]
    width = image_shape[1]
    image_dims = K.stack([height, width, height, width])
    image_dims = K.reshape(image_dims, [1, 4])
    boxes = boxes * image_dims
    return boxes

It is used in the following function 'yolo_eval' :

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
    
    Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.
    
    Arguments:
    yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
                    box_confidence: tensor of shape (None, 19, 19, 5, 1)
                    box_xy: tensor of shape (None, 19, 19, 5, 2)
                    box_wh: tensor of shape (None, 19, 19, 5, 2)
                    box_class_probs: tensor of shape (None, 19, 19, 5, 80)
    image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
    max_boxes -- integer, maximum number of predicted boxes you'd like
    score_threshold -- real value, if [ highest class probability score  threshold], then get rid of the corresponding box
    iou_threshold -- real value, intersection over union threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (None, ), predicted score for each box
    boxes -- tensor of shape (None, 4), predicted box coordinates
    classes -- tensor of shape (None,), predicted class for each box
    
    
    ### START CODE HERE ### 
    
    # Retrieve outputs of the YOLO model (≈1 line)
    box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs

    # Convert boxes to be ready for filtering functions (convert boxes box_xy and box_wh to corner coordinates)
    boxes = yolo_boxes_to_corners(box_xy, box_wh)

    # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
    scores, boxes, classes = yolo_filter_boxes(box_confidence,boxes,box_class_probs,score_threshold)

    # Scale boxes back to original image shape.
    boxes = scale_boxes(boxes, image_shape)
   
    # Use one of the functions you've implemented to perform Non-max suppression with 
    # maximum number of boxes set to max_boxes and a threshold of iou_threshold (≈1 line)
    scores, boxes, classes = yolo_non_max_suppression(scores,boxes,classes,max_boxes,iou_threshold)
    
    ### END CODE HERE ###
    
    return scores, boxes, classes

I don't understand the need for the function 'scale_boxes' . There doesn't seem to be any answers/attention to this in the discussion forums as well , which is why I'm posting this question here .

Can someone please explain in detail what this function does exactly and why it is required ?

Topic coursera object-detection yolo cnn image-classification

Category Data Science


YOLO's network was trained to run on 608x608 images. If you are testing this data on a different size image--for example, the car detection dataset had 720x1280 images--this step rescales the boxes so that they can be plotted on top of the original 720x1280 image.

Since you are using a pre-trained model. It will resize your image to the size it was trained on. Whether you do it or the Model does this in the background.

Bounding Box values are simple coordinates on the image. It will change with the change in the size of the image. Imagine a face on a big image and on a tiny image.

$\hspace{5cm}$enter image description here

So, YOLO will return you the coordinates for a smaller image and if you draw it over your original image, it will not cover the full object. So you rescale it in the ratio of two image sizes.

You can achieve the same by resizing your original image to the size of YOLO trained image and then you need not scale your bounding Box. You can simply draw the same box on this resized image.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.