Difference between text-based image retrieval and natural language object retrieval
I am working on creating a model that locates an object in the scene (2D image or 3D scene) using a natural language query. I came across this paper on natural language object retrieval that mentions that this task is different from text-based image retrieval in the sense that natural language object retrieval requires an understanding of objects in the image, spatial configurations, etc. I am not able to see the difference between these two tasks. Could you please explain it with an example?
Topic 3d-object-detection object-detection nlp machine-learning
Category Data Science