How "similarity" is measured in image retrieval?

I know what content based image retireval is. I have read this and this as one of them says: "given a query images, get a rank list that are most similar to the query image, based on the content of the query image. " But my question is how the "similar" images are determined. Assume we are working on Oxford5k dataset. The dataset contains 5k images in 17 classes. So, when I feed one of the images as a query, …
Category: Data Science

How can I train a model to modify a vector by rewarding the model based on the modified vectors nearest neighbors?

I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …
Category: Data Science

How to determine the "total number of relevant documents" in calculatiion of Recall in Precision and Recall if it's not known? Can it be estimated?

On Wikipedia there is a practical example of calculating Precision and Recall: When a search engine returns 30 pages, only 20 of which are relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3, which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells us how complete the results are. I absolutely don't understand how one can use the Precision and Recall in real/life scenario of total number …
Category: Data Science

Extracting information from bills, tax statements, etc: What ML model to use?

I have a bunch of documents such as bank statements, utilities bills, personal expenditure invoices, etc. The document types range is very broad. Some of these files are saved as pictures, others as pdfs. So far, my tactic has been to ocr all the documents, and then use some regexes to extract information (I would like to extract dates, quantities/amounts and entities). However, this hasn't worked out great so far... Thus, I was wondering what other possibilities there were in …
Category: Data Science

Information Extraction/Semantic Search for long, unstructured documents

I am stuck with a particular task of information extraction. I have a few hundred, long (5-35 pages) pdf, doc and docx project documents from which I seek to extract specific information and store them in a structured database. The ultimate goal is to extract and store information in a way that we can query those and any new incoming documents for fast and reliable information. For instance, I want to query a combination of entities from the knowledge base …
Category: Data Science

Is there a Mean Average Recall for Item Retrieval/ Recommendation Systems?

Mean Average Precision for Information retrieval is computed using Average Precision @ k (AP@k). AP@k is measured by first computing Precision @ k (P@k) and then averaging the P@k only for the k's where the document in position k is relevant. I still don't understand why the remaining P@k's are not used, but that is not my question. My question is: is there an equivalent Mean Average Recall (MAR) and Average Recall @ k (AR@k)? Recall @ k (R@k) is …
Category: Data Science

Origin of the Boolean Model of Information Retrieval

Simple question, but I can't really find the answer to that: Who "invented" Boolean Retrieval? Of course, I assume that the concept grew over time, but is there a paper or publication that mentions/defines the Boolean Model as a whole for the first time? On Wikipedia, the book by Lancaster and Fayen (1973) is cited, but I couldn't find any definition in there, either.
Category: Data Science

How to implement Semantic Search in R or Python

I have a task to provide semantic searching capabilities. For example, if I have a dataset of resume and if I search for "machine learning" than it should return me all resumes which have data science-related skills despite of missing "machine learning" keyword. How do we search the data through its meaning and related keywords I wonder? I have checked many algorithms also Like LSA, LDA, LSI but cannot find a resource which gives the implementation of the above.
Category: Data Science

Dissimilarity Matrix of non-metric proximity data

we currently have a coding exercise, where we are asked to implement Constant Shift Embedding (Paper). This in itself is not a big problem. For the algorithm, all you need is a symmetric non-zero diagonal dissimilarity matrix of some non-metric proximity data. With the algorithm you can then embed the information into a vector space and therefore you can use commonly known denoising and dimensionality reduction methods to improve the results of for example k-means clustering. Given the E-Mail communications …
Category: Data Science

Document ranking on a web scraped dataset without any labelled data

I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset …
Category: Data Science

Getting answers to bullets (numbered items) from text via NLP

This is related to information extraction. In real world data, documents are written in bullets/numbered items form. For example, How to create a website: - Get A DNS - Get a Hosting - Deploy wordpress or some site ... above is sample of a structured data. Take another example where content is semi structured, While sandeep was going to home there was a road on the way he saw a - Car - 2 wheeler - cart and he carefully …
Category: Data Science

Difference between the architectures of semantic and instance segmentation

My question is about the difference between the architectures of semantic segmentation and instance segmentation models. So, as far as I understand, a semantic segmentation model is making pixel-wise classification and, therefore, it has a dense layer at the end where the output dimension is number of labels (classes). The part that makes me confused is how instance segmentation models distinguish between the instances from same classes? How is the architecture of them? Actually, I am studying on NLP and …
Category: Data Science

How to extract details (educational details, exp details etc.) from a resume?

I am trying to build a resume parser which can extract details such as Name, Address, Education details (degree name, college name, university name, course duration), Experience details (designation, company name, company location, work duration) from any kind of resume. I tried to train a custom ner model using spacy. For that I created annotations from resumes which have entities as follows: Degree -> Degree name, College -> College name, University -> University name, Degree_date -> Degree date. Similarly created …
Category: Data Science

Find business vertical of a website just by its URL or cluster similar website by its url

I have been exploring this problem a lot about just using the website url to tag or cluster them as per their business domain. For example: amazon.com => e-commerce bbc.co.uk => news Adidas.com => sports apparel I have read through some research papers which try to cluster using different unsupervised learning clustering algorithm like CLUE link here One way to think is to create a repository of labeled websites and then create a model to tag similar websites using this …
Category: Data Science

How do I verify and test a machine learning model against reality during time?

As a software engineers we familiar with a concept of testing (unit, integration, e2e) Tests give us a level of confidence about the code and changes in our code. Looks like for ML the "code" is the data that was used for the model. And unfortunately data not so deterministic as source code. If I consider that data is kind of code for ML: What technics and tools cane be used for verifying / testing the data? My expectation is …
Category: Data Science

Approximate maximum dot product between a vector and set of vectors using only a single vector representation for the latter

If we have a vector $q$ and a set of vectors $D = \{d_1, d_2, ..., d_l\}$ is there a way to create functions $QF$ and $DF$ such that $QF(q)^TDF(D) \approx \max_i(q^Td_i)$ ? Use case: I want to build an information retrieval system in which documents are represented by an arbitrary but small ($<100$) number of vectors and the query is represented by a single vector. Ideally, I would like to sort the documents based on $\max_i(q^Td_i)$ but storing all …
Category: Data Science

How is a textual search engine able to recognize subwords from words?

I am interested to know how information retrieval systems are able to consider relevant subwords from a main search word when performing a keyword search. For example, the word wristband can either be considered as is, or as wrist band. When word tokenized, they appear as [wristband] and [wrist, band] respectively. If I am querying with wristband, the wrist and band will be ignored in the count vector. Yet, I find common search engines that are able to retrieve results …
Category: Data Science

Search / Multiple Choice System evaluation

I have a DB with N items. My system can output an ID for the item or say N/A (not found). What are different ways to evaluate the performance of such system, and what are the characteristics/tradeoffs of these? PS. Earlier I came with a definition: Ground truth ↓, Prediction → ID1 N/A ID2 TP IF ID1 == ID2 ELSE FP FN N/A FP TN Would be curious to get some thoughts / feedback on this definition and whether we …
Category: Data Science

Domain scoring based on ranking

I am a computer science student working on a small information retrieval project. I have a dictionary with a domain as a key and it's ranking as value. Based on that ranking, I need to score every domains. I was thinking to do 1/ranking but the disparity is too high. For example the first domain will have a score of 1 (1/1) and the domain ranked 10th will have a score of 0.1 which does not make sense for this. …
Category: Data Science

How to stem plural words properly?

I'm looking for a way to avoid removing ending s when s isn't a suffix. In order to do that, I first check if a word exists in my index, if it does, I don't remove the ending s but If it doesn't, I go on and remove the ending s and add it to the index. But the problem is what to do when starting to build the index. Imagine we encounter books, I remove s and add book …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.