Dataset for OCR from aerial view using YOLOv5s

The task is to detect rotated alphanumeric characters embedded on colored shapes. We will have an aerial view of the object (from a UAS: Unarmed Aerial System). Something of this sort:

(One Uppercase alphabet/number per image). We have to report 5 features: Shape ,Shape colour, alphanumeric , alphanumeric colour and alphanumeric orientation.

Right now, I am focusing on just detecting the alphanumeric and the shape.

Approach 1 : I used a pre trained EAST model for text detection, along with tesseract for text recognition. The script I wrote works fairy well for words, but it doesn't perform well at all for letters:

# import the necessary packages
from imutils.object_detection import non_max_suppression
from matplotlib import pyplot as plt
import pytesseract
import numpy as np
import argparse
import time
import cv2

# Set the path to the Tesseract binary
pytesseract.pytesseract.tesseract_cmd = C:\\Program Files\\Tesseract-OCR\\tesseract.exe

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument(-i, --image, type=str,
    help=path to input image)
ap.add_argument(-east, --east, type=str,
    help=path to input EAST text detector)
ap.add_argument(-c, --min-confidence, type=float, default=0.5,
    help=minimum probability required to inspect a region)
ap.add_argument(-w, --width, type=int, default=320,
    help=resized image width (should be multiple of 32))
ap.add_argument(-e, --height, type=int, default=320,
    help=resized image height (should be multiple of 32))
args = vars(ap.parse_args())

# load the input image and grab the image dimensions
image = cv2.imread(args[image])
orig = image.copy()
(H, W) = image.shape[:2]
# set the new width and height and then determine the ratio in change
# for both the width and height
(newW, newH) = (args[width], args[height])
rW = W / float(newW)
rH = H / float(newH)
# resize the image and grab the new image dimensions
image = cv2.resize(image, (newW, newH))
(H, W) = image.shape[:2]

# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
    feature_fusion/Conv_7/Sigmoid,
    feature_fusion/concat_3]

# load the pre-trained EAST text detector
print([INFO] loading EAST text detector...)
net = cv2.dnn.readNet(args[east])
# construct a blob from the image and then perform a forward pass of
# the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(image, 1.0, (W, H),
    (123.68, 116.78, 103.94), swapRB=True, crop=False)
start = time.time()
net.setInput(blob)
(scores, geometry) = net.forward(layerNames)
end = time.time()
# show timing information on text prediction
print([INFO] text detection took {:.6f} seconds.format(end - start))

# grab the number of rows and columns from the scores volume, then
# initialize our set of bounding box rectangles and corresponding
# confidence scores
(numRows, numCols) = scores.shape[2:4]
rects = []
confidences = []
# loop over the number of rows
for y in range(0, numRows):
    # extract the scores (probabilities), followed by the geometrical
    # data used to derive potential bounding box coordinates that
    # surround text
    scoresData = scores[0, 0, y]
    xData0 = geometry[0, 0, y]
    xData1 = geometry[0, 1, y]
    xData2 = geometry[0, 2, y]
    xData3 = geometry[0, 3, y]
    anglesData = geometry[0, 4, y]

# loop over the number of columns
    for x in range(0, numCols):
        # if our score does not have sufficient probability, ignore it
        if scoresData[x]  args[min_confidence]:
            continue
        # compute the offset factor as our resulting feature maps will
        # be 4x smaller than the input image
        (offsetX, offsetY) = (x * 4.0, y * 4.0)
        # extract the rotation angle for the prediction and then
        # compute the sin and cosine
        angle = anglesData[x]
        cos = np.cos(angle)
        sin = np.sin(angle)
        # use the geometry volume to derive the width and height of
        # the bounding box
        h = xData0[x] + xData2[x]
        w = xData1[x] + xData3[x]
        # compute both the starting and ending (x, y)-coordinates for
        # the text prediction bounding box
        endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
        endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
        startX = int(endX - w)
        startY = int(endY - h)
        # add the bounding box coordinates and probability score to
        # our respective lists
        rects.append((startX, startY, endX, endY))
        confidences.append(scoresData[x])
        # apply non-maxima suppression to suppress weak, overlapping bounding
# boxes
boxes = non_max_suppression(np.array(rects), probs=confidences)
# loop over the bounding boxes
results=[]

for (startX, startY, endX, endY) in boxes:
    # scale the bounding box coordinates based on the respective
    # ratios
    startX = max(0, int(startX * rW))
    startY = max(0, int(startY * rH))
    endX = int(endX * rW)
    endY = int(endY * rH)
    #extract the region of interest
    print(DEBUG:, startY, endY, startX, endX)
    r = orig[startY:endY, startX:endX]

    #configuration setting to convert image to string. 

    #configuration = (-l eng --oem 1 --psm 8) #For pytesseract
    configuration = (-l eng --oem 1 --psm 10) #For pytesseract
        ##This will recognize the text from the image of bounding box
    text = pytesseract.image_to_string(r, config=configuration)
    print(text)

    # append bbox coordinate and associated text to the list of results 
    results.append(((startX, startY, endX, endY), text))


#Display the image with bounding box and recognized text
orig_image = orig.copy()

# Moving over the results and display on the image
for ((start_X, start_Y, end_X, end_Y), text) in results:
    # display the text detected by Tesseract
    print({}\n.format(text))

    # Displaying text
    text = .join([x if ord(x)  128 else  for x in text]).strip()
    print(text)

    
    cv2.rectangle(orig_image, (start_X, start_Y), (end_X, end_Y),
        (0, 0, 255), 2)
    cv2.putText(orig_image, text, (start_X, start_Y - 30),
        cv2.FONT_HERSHEY_SIMPLEX, 0.7,(0,0, 255), 2)

plt.imshow(orig_image)
plt.title('Output')
plt.show()

Results:

However on images containing characters, the script doesn't work well. The EAST model fails to draw a bounding box around the character. I also don't think this will work with the characters(or even text) if they are oriented at some random angle. I guess a better fix is to use a better trained model, or try to overfit the EAST model on a character-specific dataset... I got the pretrained model(frozen_east_text_detection.pb) from here: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1

Approach 2: Using a pre-trained Yolov5s model, and train it (fine tune it) on a dataset containing the images of the kind I mentioned at the start. https://github.com/ultralytics/yolov5/wiki/Train-Custom-Data .

The first step is to create a dataset containing the kind of images I mentioned in the beginning. I already have a set of images providing just an aerial view, something like this:

Similarly, say I have other sets of images: one set containing colored geometric shapes, and one set containing letters and alphabets.

How can I use openCV to embed the characters on the shapes, and then embed the whole thing on to the aerial view images (creating random combinations)?

Another thing required for training the Yolo model is a labels text file for each image, that contains the following 5 parameters: (c,x,y,w,h), where:

  • c is the class index (say 0 for A, 1 for B....)
  • x,y are the coordinates of the center of the bounding box (normalized)
  • w,h are the width and height of the bounding box (normalized).

So while embedding, we need to write this data to a text file (2 lines for each image, one for the alphanumeric character, one for the shape.).

There are some websites like roboflow that can create these labels for you, but from what I know the bounding boxes need to be drawn manually on them...So I don't think thats going to be a viable approach.

How can we approach writing this python script that embeds the images together and creates the appropriate labels?

Topic opencv yolo ocr dataset

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.