Transfer learning on YOLOv5 for character and shape detection
The task is to detect rotated alphanumeric characters embedded on colored shapes. We will have an aerial view of the object (from a UAS: Unarmed Aerial System), something of this sort:
(One Uppercase alphabet/number per image). We have to report 5 features: Shape, Shape color, alphanumeric, alphanumeric color, and alphanumeric orientation.
Right now, I am focusing on just detecting the alphanumeric and the shape.
Using OpenCV, I have created a sample image by embedding (shape+alphanumeric) image on an aerial view image.(The shapes and alphabets have been rotated by a random angle), something of this kind:
Now, I plan to use a pre-trained YOLOv5 model for detecting the alphanumeric, and shape detection. Basically, I want to perform transfer-learning, i.e fine-tune it for detecting characters and shapes.
I have a script ready that creates the dataset for this purpose. Right now I have one image, but by running a few for loops, I can create many combinations of the (shape+character+aerial view)image to create a dataset, however, I have a few questions to ask before I proceed:
1. What roughly, should be the ideal size of that dataset for performing transfer learning of this sort? A tutorial on yolov5 here:https://docs.ultralytics.com/tutorials/train-custom-datasets/ uses just the 128 images (coco128) for training the pre-trained model again. Is there an adverse effect of using a large number of images for fine-tuning? Right now, I plan to use about 1000 images, although the script is capable of creating much more. Also, we need to consider the fact that the network needs to detect both shapes and characters.
2.If the answer to 1 is that we need small datasets, then to what extent should I consider rotating the texts and shapes? With lesser images, I fear that the network will not gain the ability to learn what truly is a $9$ or an $F$. Right now, I have used a gaussian distribution with mean=0, standard deviation=50.
3.How exactly should I approach creating the labels in the yolo format? There will be exactly 2 labels per training image. Now, there are some tools such as roboflow that can create the labels in the yolo format, but from what I've seen, we will need to annotate the images manually. Now if the dataset is say ~1000 images then I can consider doing it manually but not for sizes much larger than that...Isn't there a more efficient way of creating labels for large datasets?
4. Are there any pre-processing steps that should be kept in mind before starting the training? Such as changing the input shape or sizing them up to a particular size? I do plan the blur the entire thing (GaussianBlur) once created.