information-retrieval

How "similarity" is measured in image retrieval?

David

2022年5月30日 23:00

I know what content based image retireval is. I have read this and this as one of them says: "given a query images, get a rank list that are most similar to the query image, based on the content of the query image. " But my question is how the "similar" images are determined. Assume we are working on Oxford5k dataset. The dataset contains 5k images in 17 classes. So, when I feed one of the images as a query, …

Topic: computer-vision information-retrieval machine-learning

Category: Data Science

How can I train a model to modify a vector by rewarding the model based on the modified vectors nearest neighbors?

RossDeVito

2022年5月15日 20:28

I am experimenting with a document retrieval system in which I have documents represented as vectors. When queries come in, they are turned to vectors by the same method as used for the documents. The query vector's k nearest neighbors are retrieved as the results. Each query has a known answer string. In order to improve performance, I am now looking to create a model that modifies the query vector. What I was looking to do was use a model …

Topic: vector-space-models training reinforcement-learning information-retrieval machine-learning

Category: Data Science

How to determine the "total number of relevant documents" in calculatiion of Recall in Precision and Recall if it's not known? Can it be estimated?

Banik

2022年4月28日 15:17

On Wikipedia there is a practical example of calculating Precision and Recall: When a search engine returns 30 pages, only 20 of which are relevant, while failing to return 40 additional relevant pages, its precision is 20/30 = 2/3, which tells us how valid the results are, while its recall is 20/60 = 1/3, which tells us how complete the results are. I absolutely don't understand how one can use the Precision and Recall in real/life scenario of total number …

Topic: learning-to-rank ranking evaluation information-retrieval recommender-system

Category: Data Science

Extracting information from bills, tax statements, etc: What ML model to use?

An old man in the sea.

2022年4月8日 18:05

I have a bunch of documents such as bank statements, utilities bills, personal expenditure invoices, etc. The document types range is very broad. Some of these files are saved as pictures, others as pdfs. So far, my tactic has been to ocr all the documents, and then use some regexes to extract information (I would like to extract dates, quantities/amounts and entities). However, this hasn't worked out great so far... Thus, I was wondering what other possibilities there were in …

Topic: information-extraction nlp information-retrieval

Category: Data Science

Information Extraction/Semantic Search for long, unstructured documents

XsLiar

2022年3月29日 16:06

I am stuck with a particular task of information extraction. I have a few hundred, long (5-35 pages) pdf, doc and docx project documents from which I seek to extract specific information and store them in a structured database. The ultimate goal is to extract and store information in a way that we can query those and any new incoming documents for fast and reliable information. For instance, I want to query a combination of entities from the knowledge base …

Topic: named-entity-recognition text-mining nlp information-retrieval

Category: Data Science

Is there a Mean Average Recall for Item Retrieval/ Recommendation Systems?

VaM

2022年3月17日 12:49

Mean Average Precision for Information retrieval is computed using Average Precision @ k (AP@k). AP@k is measured by first computing Precision @ k (P@k) and then averaging the P@k only for the k's where the document in position k is relevant. I still don't understand why the remaining P@k's are not used, but that is not my question. My question is: is there an equivalent Mean Average Recall (MAR) and Average Recall @ k (AR@k)? Recall @ k (R@k) is …

Topic: model-evaluations evaluation information-retrieval recommender-system

Category: Data Science

Origin of the Boolean Model of Information Retrieval

TiMauzi

2022年3月4日 15:00

Simple question, but I can't really find the answer to that: Who "invented" Boolean Retrieval? Of course, I assume that the concept grew over time, but is there a paper or publication that mentions/defines the Boolean Model as a whole for the first time? On Wikipedia, the book by Lancaster and Fayen (1973) is cited, but I couldn't find any definition in there, either.

Topic: history information-retrieval definitions

Category: Data Science

How to implement Semantic Search in R or Python

Yash Kanojia

2022年2月24日 11:07

I have a task to provide semantic searching capabilities. For example, if I have a dataset of resume and if I search for "machine learning" than it should return me all resumes which have data science-related skills despite of missing "machine learning" keyword. How do we search the data through its meaning and related keywords I wonder? I have checked many algorithms also Like LSA, LDA, LSI but cannot find a resource which gives the implementation of the above.

Topic: similar-documents deep-learning information-retrieval machine-learning

Category: Data Science

Dissimilarity Matrix of non-metric proximity data

ninji

2022年2月9日 03:06

we currently have a coding exercise, where we are asked to implement Constant Shift Embedding (Paper). This in itself is not a big problem. For the algorithm, all you need is a symmetric non-zero diagonal dissimilarity matrix of some non-metric proximity data. With the algorithm you can then embed the information into a vector space and therefore you can use commonly known denoising and dimensionality reduction methods to improve the results of for example k-means clustering. Given the E-Mail communications …

Topic: similarity information-retrieval

Category: Data Science

Document ranking on a web scraped dataset without any labelled data

sarva

2022年2月4日 05:04

I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset …

Topic: bert similar-documents text-mining nlp information-retrieval

Category: Data Science

Getting answers to bullets (numbered items) from text via NLP

Sandeep Bhutani

2022年1月31日 13:01

This is related to information extraction. In real world data, documents are written in bullets/numbered items form. For example, How to create a website: - Get A DNS - Get a Hosting - Deploy wordpress or some site ... above is sample of a structured data. Take another example where content is semi structured, While sandeep was going to home there was a road on the way he saw a - Car - 2 wheeler - cart and he carefully …

Topic: nlp information-retrieval

Category: Data Science

Difference between the architectures of semantic and instance segmentation

The Exile

2021年11月25日 16:46

My question is about the difference between the architectures of semantic segmentation and instance segmentation models. So, as far as I understand, a semantic segmentation model is making pixel-wise classification and, therefore, it has a dense layer at the end where the output dimension is number of labels (classes). The part that makes me confused is how instance segmentation models distinguish between the instances from same classes? How is the architecture of them? Actually, I am studying on NLP and …

Topic: information-extraction computer-vision deep-learning nlp information-retrieval

Category: Data Science

How to extract details (educational details, exp details etc.) from a resume?

SRJ577

2021年11月11日 13:00

I am trying to build a resume parser which can extract details such as Name, Address, Education details (degree name, college name, university name, course duration), Experience details (designation, company name, company location, work duration) from any kind of resume. I tried to train a custom ner model using spacy. For that I created annotations from resumes which have entities as follows: Degree -> Degree name, College -> College name, University -> University name, Degree_date -> Degree date. Similarly created …

Topic: spacy nlp python information-retrieval

Category: Data Science

Find business vertical of a website just by its URL or cluster similar website by its url

think-maths

2021年9月23日 10:44

I have been exploring this problem a lot about just using the website url to tag or cluster them as per their business domain. For example: amazon.com => e-commerce bbc.co.uk => news Adidas.com => sports apparel I have read through some research papers which try to cluster using different unsupervised learning clustering algorithm like CLUE link here One way to think is to create a repository of labeled websites and then create a model to tag similar websites using this …

Topic: web-scraping python-3.x unsupervised-learning information-retrieval clustering

Category: Data Science

How do I verify and test a machine learning model against reality during time?

BogdanSnisar

2021年6月28日 11:07

As a software engineers we familiar with a concept of testing (unit, integration, e2e) Tests give us a level of confidence about the code and changes in our code. Looks like for ML the "code" is the data that was used for the model. And unfortunately data not so deterministic as source code. If I consider that data is kind of code for ML: What technics and tools cane be used for verifying / testing the data? My expectation is …

Topic: mlops cross-validation information-retrieval data-mining machine-learning

Category: Data Science

Approximate maximum dot product between a vector and set of vectors using only a single vector representation for the latter

Curious Ion

2021年6月11日 12:35

If we have a vector $q$ and a set of vectors $D = \{d_1, d_2, ..., d_l\}$ is there a way to create functions $QF$ and $DF$ such that $QF(q)^TDF(D) \approx \max_i(q^Td_i)$ ? Use case: I want to build an information retrieval system in which documents are represented by an arbitrary but small ($<100$) number of vectors and the query is represented by a single vector. Ideally, I would like to sort the documents based on $\max_i(q^Td_i)$ but storing all …

Topic: vector-space-models information-retrieval

Category: Data Science

How is a textual search engine able to recognize subwords from words?

Darrel

2021年6月11日 06:45

I am interested to know how information retrieval systems are able to consider relevant subwords from a main search word when performing a keyword search. For example, the word wristband can either be considered as is, or as wrist band. When word tokenized, they appear as [wristband] and [wrist, band] respectively. If I am querying with wristband, the wrist and band will be ignored in the count vector. Yet, I find common search engines that are able to retrieve results …

Topic: nlp information-retrieval

Category: Data Science

Search / Multiple Choice System evaluation

kazmikh

2021年6月9日 00:16

I have a DB with N items. My system can output an ID for the item or say N/A (not found). What are different ways to evaluate the performance of such system, and what are the characteristics/tradeoffs of these? PS. Earlier I came with a definition: Ground truth ↓, Prediction → ID1 N/A ID2 TP IF ID1 == ID2 ELSE FP FN N/A FP TN Would be curious to get some thoughts / feedback on this definition and whether we …

Topic: metric evaluation information-retrieval search

Category: Data Science

Domain scoring based on ranking

pierre

2021年4月2日 15:07

I am a computer science student working on a small information retrieval project. I have a dictionary with a domain as a key and it's ranking as value. Based on that ranking, I need to score every domains. I was thinking to do 1/ranking but the disparity is too high. For example the first domain will have a score of 1 (1/1) and the domain ranked 10th will have a score of 0.1 which does not make sense for this. …

Topic: dataset information-retrieval

Category: Data Science

How to stem plural words properly?

Mahdi Ghajary

2021年3月7日 14:06

I'm looking for a way to avoid removing ending s when s isn't a suffix. In order to do that, I first check if a word exists in my index, if it does, I don't remove the ending s but If it doesn't, I go on and remove the ending s and add it to the index. But the problem is what to do when starting to build the index. Imagine we encounter books, I remove s and add book …

Topic: indexing nlp information-retrieval

Category: Data Science

About