Get Label Statistics of Image Dataset

I have a labeled image dataset, where the images are in subfolders and there is one Pascal XML per image with the labels.

I would like to compute stats like: how many images have exactly two labels?

Or - what is the average size of the labeling rectangle?

Ideally also statistics on image resolution, file size etc, but mostly labels.

This is probably an easy question (many papers include that info), but did not see that function in labelImg and Sloth. How can I do that?

Topic labels image-classification

Category Data Science



From this SO answer

https://stackoverflow.com/a/53832130/1727543

import xml.etree.ElementTree as ET
import os

def read_content(xml_file: str):

    tree = ET.parse(xml_file)
    root = tree.getroot()

    list_with_all_boxes = []

    for boxes in root.iter('object'):

        filename = root.find('filename').text

        ymin, xmin, ymax, xmax = None, None, None, None

        for box in boxes.findall("bndbox"):
            ymin = int(box.find("ymin").text)
            xmin = int(box.find("xmin").text)
            ymax = int(box.find("ymax").text)
            xmax = int(box.find("xmax").text)

        list_with_single_boxes = [xmin, ymin, xmax, ymax]
        list_with_all_boxes.append(list_with_single_boxes)

    return {'filename':filename,'box_list': list_with_all_boxes, 'box_len':len(list_with_all_boxes)}

result = []
for file in os.listdir("path_to_Annotation_xml_folder"):
    result.append(read_content(file))

Extract more relevant features using 'EelementTree' to add your analysis.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.