I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens? Treating the class token …
I am training a neural network with some convolution layers for multi class image classification. I am using keras to build and train the model. I am using 1600 images for all categories for training. I have used softmax as final layer activation function. The model predicts well on all True categories with high softmax probability. But when I test model on new or unknown data, it predicts with high softmax probability. How can I reduce that? Should I make …
I am building this content based image retrieval system. I basically extract feature maps of size 1024x1x1 using any backbone. I then proceed to apply PCA on the extracted features in order to reduce dimensions. I use either nb_components=300 or nb_components=400. I achieved these performances (dim_pca means no pca applied) Is there any explanation of why k=300 works better then k=400 ? If I understand, k=400 is suppose to explain more variance then k=300 ? Is it my mistake or …
I have been scratching my head for a while. What I have is a scanned PDF document with text and water marked logo at the back as in the below image. I want to do OCR over this, which becomes very difficult because of the logo. All the ratchet I've done so far is for coloured images where they can find contrast difference. I've hit a wall when solving the same for an B&W image as shown. Would love any …
I am currently training a few custom models that require about 12Gb GPU memory at the most. My setup has about 96Gb of GPU memory and python/Jupyter still manages to hog up all the gpu memory to the point that I get the Resource exhausted error thrown at me. I am stuck at this peculiar issue for a while and hence any help will be appreciated. Now, when loading a vgg based model similar to this: from keras.applications.vgg16 import VGG16 …
I know what content based image retireval is. I have read this and this as one of them says: "given a query images, get a rank list that are most similar to the query image, based on the content of the query image. " But my question is how the "similar" images are determined. Assume we are working on Oxford5k dataset. The dataset contains 5k images in 17 classes. So, when I feed one of the images as a query, …
To further explain my question. I am implementing 2 models. 1 is for action recognition and the 2nd is for weapon recognition. If there is a situation where a person is punching or kicking someone and carrying a weapon, my model should be able to detect the action and also a weapon, if that person is carrying any weapon in hand simultaneously. This can be useful for security purposes. So I want to combine these 2 models so that it …
I want to train a model for object detection. How do I have to labeling the train data? Is it enough to label the class/content of each box in the image or do I have to add the box position additionally? Thank you
I am using celeba dataset to train my CNN face landmark detection model. Here is my model class LandmarkModel: def __init__(self,inp_shape): self.model = models.Sequential() self.model.add(layers.Conv2D(16, (3, 3), activation='relu', input_shape=inp_shape))#l1 self.model.add(layers.Conv2D(32,(3, 3), activation='relu')) self.model.add(layers.MaxPooling2D((2, 2))) self.model.add(layers.Conv2D(64,(3, 3), activation='relu')) self.model.add(layers.Flatten()) self.model.add(layers.Dense(512)) self.model.add(layers.Dense(10)) def getModel(self): return self.model I have trained my model for around 5k-6k images with loss of 0.1. When I use image from dataset that is outside of training sample I get correct prediction. But when I use my own clicked …
I have a yolov3 model for object detection on 9 classes. What is the difference in computing metrics (such as mAP) on validation set and on a test set (unseen data)? What is usually done in literature and why?
I would like to create an application that adds image filters (Snapchat-style) to photos of cats or chairs (just for the sake of this question). In order to do that properly, I thought of using Active Shape Modelling algorithms to have a model to apply the filters to. I trained an object detection model to identify those items in an image (yolov5), so I now have a bounding box around each item, but I still don't know its exact shape …
I need to detect the rotation of a cable (degree) in the x-axis with high precision [0.2 (or more) degree detection] from its original state. Detailed description: I have a cable that is set in its original state. The system has rotated the cable in the x-axis. I want to know the degree the cable has been rotated from its original state. Example: There're following images for a specific cable in different rotation (angle) [0, 0.4, 0.6, 0.8]: 1) 2) …
To illustrate the above title. Suppose you have a pdf document, which is basically scanned from hardcopy, now there are set of fixed questions to answer from the document itself. For an example a document contains a contract of land, now the set of fixed questions be "who is the seller?" "what is price of the asset? ", document has referred to this answers probably 2-3 times, as a human it's a simple task. How to automate this?
I am confused about which CNNs are generally used inside autoencoder architectures for learning image representations. Is it more common to use a large existing network like ResNet or VGG, or do most people write their own smaller networks? What are the pros and cons of each? If people are using a large network like ResNet or VGG, does the decoder mirror the same steps taken by the encoder, or can a more simple decoding network be used? I am …
I have a task where I need to only plot the training loss and not the validation loss of the plot_losses function in the fastai library with learner object having recorder class, but I am not able to properly implement the same. I am using the fastai v1 for this purpose due to project restrictions. Here is the github code for the same: class Recorder(LearnerCallback): "A `LearnerCallback` that records epoch, loss, opt and metric data during training." def plot_losses(self, skip_start:int=0, …
I read that a confusion matrix is used with image classification but if I need to draw it with image captioning how to use it or can I draw it in the evaluation model phase for example if yes how can I start?
If you train your YOLO model only on grayscale images to detect car, then would it able to recognise a car in a colored image also. If so, then can I assume that YOLO consider only object shape not color? Kindly clarify.
Generally speaking, for training a machine learning model, the size of training data set should be bigger than the number of predictors. For a neural network, or even a deep learning model, the number of parameters are usually tens of thousands or even millions. It seems that in practice, the number of training data set, i.e., the number of images, is usually less than the number of parameters. How to explain this? I know, we can claim that the pre-trained …
I'm facing an interesting problem involving medical images. We are set out to test an hypothesis if certain objects in an image affect the diagnosis of a patient. I would love to hear any comments regarding my pipeline but this is my current approach: Segment the image in order to obtain the object's regions. This would be done using off-the-shelf resnet and labeled data obtained from the manual annotation of the images in hand. Now, that I have the segmented …
I have datasets of brain MR images with tumours, the tumours are already selected manually by a physicist using Image J. I have read about segmentation, but I still couldn't understand how do they extract features from a segmented image. should the images have only the tumor with a black background as shown in the below images, so the feature extraction will be processed on the whole image? or do they extract features only on the region of interest using …