Difference between text-based image retrieval and natural language object retrieval

Question

Difference between text-based image retrieval and natural language object retrieval

Sid

2022年4月20日 15:01

I am working on creating a model that locates an object in the scene (2D image or 3D scene) using a natural language query. I came across this paper on natural language object retrieval that mentions that this task is different from text-based image retrieval in the sense that natural language object retrieval requires an understanding of objects in the image, spatial configurations, etc. I am not able to see the difference between these two tasks. Could you please explain it with an example?

Topic 3d-object-detection object-detection nlp machine-learning

Category Data Science

Erwan · Accepted Answer · 2020年10月10日 19:17

Disclaimer: I can only answer for the NLP part since I'm no expert for image processing.

I assume that text-based image retrieval is the task of finding the image (or the part of an image) which corresponds to a short text which exclusively describes the object. Practically it means that any content word (i.e. excluding grammatical words like determiners) in the text refers directly to the object: "a bike", "a black cat", "the red car", etc. For a ML process it means that there's nothing to analyze in the text, every word can directly be associated with a characteristic of the image.

By contrast Natural Language object retrieval involves analyzing the text. For instance "the cat on the left of the picture" is different than "the picture on the left of the cat", even though the words are the same. Additionally there can be different ways to refer to the same object: "the book at the left of the shelf" may be the same as "the leftmost book" or "the book next to the green book". There are usually many ways to express the same meaning with language, and that makes the task much more complex. Additionally I would assume that mapping positional descriptions to the image characteristics can be tricky: "the man behind the tree" or "the second bridge" in a 2D image requires the model to "understand" depth. In a picture with two dogs, "the small dog" requires the model to "understand" size relation between objects. Humans intuitively know how to interpret these sentences, but for a machine Natural Language Understanding hasn't been solved yet (it might never be).

Difference between text-based image retrieval and natural language object retrieval

About