Transfer learning on YOLOv5 for character and shape detection

The task is to detect rotated alphanumeric characters embedded on colored shapes. We will have an aerial view of the object (from a UAS: Unarmed Aerial System), something of this sort:

(One Uppercase alphabet/number per image). We have to report 5 features: Shape, Shape color, alphanumeric, alphanumeric color, and alphanumeric orientation.

Right now, I am focusing on just detecting the alphanumeric and the shape.

Using OpenCV, I have created a sample image by embedding (shape+alphanumeric) image on an aerial view image.(The shapes and alphabets have been rotated by a random angle), something of this kind:

Now, I plan to use a pre-trained YOLOv5 model for detecting the alphanumeric, and shape detection. Basically, I want to perform transfer-learning, i.e fine-tune it for detecting characters and shapes.

I have a script ready that creates the dataset for this purpose. Right now I have one image, but by running a few for loops, I can create many combinations of the (shape+character+aerial view)image to create a dataset, however, I have a few questions to ask before I proceed:

1. What roughly, should be the ideal size of that dataset for performing transfer learning of this sort? A tutorial on yolov5 here:https://docs.ultralytics.com/tutorials/train-custom-datasets/ uses just the 128 images (coco128) for training the pre-trained model again. Is there an adverse effect of using a large number of images for fine-tuning? Right now, I plan to use about 1000 images, although the script is capable of creating much more. Also, we need to consider the fact that the network needs to detect both shapes and characters.

2.If the answer to 1 is that we need small datasets, then to what extent should I consider rotating the texts and shapes? With lesser images, I fear that the network will not gain the ability to learn what truly is a $9$ or an $F$. Right now, I have used a gaussian distribution with mean=0, standard deviation=50.

3.How exactly should I approach creating the labels in the yolo format? There will be exactly 2 labels per training image. Now, there are some tools such as roboflow that can create the labels in the yolo format, but from what I've seen, we will need to annotate the images manually. Now if the dataset is say ~1000 images then I can consider doing it manually but not for sizes much larger than that...Isn't there a more efficient way of creating labels for large datasets?

4. Are there any pre-processing steps that should be kept in mind before starting the training? Such as changing the input shape or sizing them up to a particular size? I do plan the blur the entire thing (GaussianBlur) once created.

Topic transfer-learning yolo training convolutional-neural-network deep-learning

Category Data Science


First of all, YOLOv5 is not the best alternative for character and shape detection. The reason is that your samples are well defined, and you can achieve the same goal using basic classifiers.

  1. Images per class. ≥ 1500 images per class recommended. Instances per class. ≥ 10000 instances (labeled objects) per class recommended. I agree with the first answer
  2. You can use basic Data Augmentation techniques such as Geometric Transformations, Photometric Transformations, Random Occlusions, etc. Check this: https://github.com/ErikValle/Data-Augmentation-for-YOLOv5
  3. Check this: https://github.com/tzutalin/labelImg. Then, follow the tutorial.
  4. It's up to you. I mean, it is recommended to follow any pre-processing steps, but it is not required.

YOLO is an object detection algorithm, considering your usecase of recognising alphanumeric characters it would be ideal to go for OCR(optical character recognition) which works great for written and handwritten characters. It's also ideal to opt for text detectors like EAST or CRAFT.

I would suggest to go for keras OCR which is a packaged version of CRAFT text detector and Keras CRNN recognition model. Another one is Tesseract which is an optical character recognition engine supporting various operating systems.

Now coming to your questions at hand.

  1. SIZE of dataset for training: A minimum 1500 images per class * is recommended as per official YOLOv5 documentation, best performance tips*. However this will vary based on your chosen model.

    They have mentioned YOLOv5 provides best results for Datasets with

    • Images per class: $≥1.5k$ images per class
    • Instances per class. $≥10k$ instances (labeled objects) per class total
    • Image variety: Recommends images from different times of day, different seasons, different weather, different lighting, different angles, different sources (scraped online, collected locally, different cameras) etc.
    • Label consistency. All instances of all classes in all images must be labelled. Partial labelling will not work.
    • Label accuracy. Labels must closely enclose each object. No space should exist between an object and it's bounding box. No objects should be missing a label. Background images. Background images are images with no objects that are added to a dataset to reduce False Positives (FP). We recommend about 0-10% background images to help reduce FPs (COCO has 1000 background images for reference, 1% of the total)
  2. Rotation of Text and shapes

    $≥40$% of rotated alphanumeric images is a good ratio to consider. There are couple of things you need to also keep in mind, Fonts distinction whilst rotation of images ensuring the following will be correctly identified. Also Keep in mind how various fonts would affect in recognising these combinations of alphabets and numbers when rotated

    • ( W,M )
    • (6,9)
    • (P, d)
    • (L,7)
    • (I,1)
    • (Z,N)
    • (0,O)
    • (V,^,<,>) if you are also considering symbols
  3. Label Creation

    There are many open source labeling tools LabelIMG, Labelbox, ImageTagger, LabelMe

    These are the top software for labelling: SuperAnnotate, Appen, Amazon Sagemaker Ground Truth, V7, Dataloop, Hive Data and Innotescus Video and Image Annotation Platform

  4. PreProcessing steps:

    Image preprocessing for OCR handwritten characters follows the below steps

    • image binarization
    • waste clearing or waste filtering algorithms
    • text lines detection
    • character detection

    This research paper explains various transformation processes

    • image enhancement (reducing the noise and detecting the useful objects),
    • binarization (excluding information redundancy), and
    • allocation of dot matrix fields.
    • Discrete smoothing to binary image, which helps to eliminate some noise(blurred boundaries, obliterated corners, separate points)

References


1. What roughly, should be the ideal size of that dataset for performing transfer learning of this sort?

It depends on your live case scenario. If your live examples are expected to vary a lot, it is recommended to train on a dataset with large diversity, not just large dataset of similar images.

Identify few pointers to help estimate the dataset size:

  1. Can shape and character color be same?
  2. Can character sometimes be inside the shape and sometimes be outside the shape?
  3. Can characters be laterally inverted (mirror images)?
  4. Can size of character or shapes be different in different images?
  5. Can some images have no characters or no shapes?

For each of the above question, answering YES will increase problem complexity and will need more diverse collection of input images.

1000 images covering multiple different expected live case images should work fine as you are using transfer learning. Just ensure to have a diverse set of images.

2. What extent should I consider rotating the texts and shapes?

Rotating images is a good idea and should be implemented. Again, work on your live case scenario and implement rotation.

3.How exactly should I approach creating the labels in the yolo format?

You are creating the images, right? Your code will have input shape and character before superimposing them, use it to create labels.

4. Are there any pre-processing steps that should be kept in mind before starting the training?

Yes and its very important.

Few steps which I can think of with the context provided in question:

  1. Greyscaling
  2. Normalization
  3. Standard sizing

Just an additional suggestion:

  • Single Model solutions are difficult to achieve great results, create a pipeline of models.
  • Use a Canny Edge Detection and feed into a simple model to identify character.
  • Then you can use your model to identify the shape only.
  • Also, have a look at MNIST dataset, I think rotating and superimposing it on shapes will help you with dataset worries.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.