search-engine

How can I improve the recall of a certain class in a multiclass-classification result

leakey

2022年5月10日 06:06

I am working on a multiclass classification which is to assign medical related queries of web search to certain departments of hospital.My classifier is based on the fastText. I found for most conditions, the result is good enough say recall is 0.8 for Nephrology. However, for just one department, Dermatology, the recall is pretty low,like 0.5. Unfortunately, this label has most samples in the test data. How can I improve the recall of one class while maintain the performance of …

Topic: search-engine multiclass-classification nlp

Category: Data Science

Learning to Rank with Unlabelled Dataset

amber

2022年3月28日 22:57

I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?

Topic: learning-to-rank search-engine xgboost ranking nlp

Category: Data Science

Best way to vectorise names and addresses for similarity searching?

Sandy Lee

2022年3月2日 17:44

I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified. My inital thoughts on an approach is to vectorise each name and address as a …

Topic: elastic-search k-nn search-engine word-embeddings nlp

Category: Data Science

Best methods to choose between different searching models?

Pythoner

2022年1月7日 20:23

My question here is in regards to best practices and current methods for selecting search models on the fly based on a users query. Lets say I have four searching models, each optimized for their respective types: Model A: Embedding-based, used for sentence queries about scientific topics Model B: Embedding-based, used for sentence queries about general news topics Model C: TF*IDF-based, used for keyword queries about scientific topics Model D: TF*IDF-based, used for keyword queries about general news topics When …

Topic: search-engine deep-learning nlp machine-learning

Category: Data Science

Is this a potentially acceptable way to compare Google result quantities?

Christina Keathley

2021年10月13日 13:21

I’ve been recently trying to compare internet presence of a species to trend data I have collected. After reading a stack of papers on hit count estimates, I’m well aware that the number of results are, at best, an estimate. My question is, how far off would it be to compare terms that produce drastic differences in the number of results? For instance: “White-throated jay” OR “scientific name” yields roughly 7-14,00 results depending on the day/amount of time the query …

Topic: search-engine google

Category: Data Science

Why do we calculate the vector of a document by averaging the vectors of all the words?

amber

2021年10月7日 14:06

I am trying to build a search engine to query a folder of documents. Tutorials online suggest that we should obtain the vector of a document by averaging the vectors of all the words, then compare similarity to the vector of the query. May I know how does the vector of all the words in the document retain the information of the words? Would it be better if i retrieved similar words of the query and checked if these words …

Topic: spacy search-engine gensim word2vec nlp

Category: Data Science

How does Google's 'showing results for' work?

google

2021年5月27日 07:43

If I search 'I love to eate my food' on Google then Google will 'show results for' I love to eat my food.... How does this algorithm work?

Topic: search-engine deep-learning google algorithms machine-learning

Category: Data Science

Could you generate search queries to poison data analysis by a search engine?

Christopher Done

2021年3月5日 17:25

A simple problem with search engines is that you have to trust that they will not build a profile of search queries you submit. (Without Tor or e.g. homomorphic encryption, that is.) Suppose we put together a search engine server with a use policy that permits constant queries being sent by paid customers. The search engine's client transmits, at some frequency, generated search queries (e.g. markov, ML-generated, random dictionary words, sourced from news, whatever; up to you) in order to …

Topic: adversarial-ml search-engine

Category: Data Science

About Natural Question (NQ) benchmark in NLP

giniper

2021年2月22日 22:28

I recently learned that there is a benchmark called NQ. https://ai.google.com/research/NaturalQuestions/visualization Unlike other QA benchmarks which relevant document is povided with query, it has to find information from millions of corpus by itself. For example, if question is "when are hops added to the brewing process?" other QA benchmark also provide only 1 document about brewing. While NQ provide whole wikipedia text and model has to find most relevant document and answer. When I tried all the example in the …

Topic: bert search-engine nlp

Category: Data Science

What is the difference between Okapi bm25 and NMSLIB?

coderina

2021年2月16日 18:30

I was trying to make a search system and then I got to know about Okapi bm25 which is a ranking function like tf-idf. You can make an index of your corpus and later retrieve documents similar to your query. I imported a python library rank_bm25 and created a search system and the results were satisfying. Then I saw something called Non-metric space library. I understood that its a similarity search library much like kNN algorithm. I saw an example …

Topic: search-engine python-3.x nlp information-retrieval

Category: Data Science

What is the formula and log base for idf?

variable

2020年5月15日 10:57

To calculate tf-idf, we do: tf*idf tf=number of times word occurs in document What is formula for idf and log base: Log(number of documents/number of documents containing the word) Log((1+number of documents)/(1+number of documents containing the word)) 1+Log(number of documents/number of documents containing the word) 1+Log((1+number of documents)/(1+number of documents containing the word))

Topic: search-engine tfidf

Category: Data Science

What ML/DL techniques power Youtube/Netflix search systems?

Anuj Gupta

2020年2月7日 06:49

Video platforms like YouTube, Netflix, Amazon prime have an excellent search system - given a search string, find most relevant videos. Which Machine Learning /Deep Learning techniques used for this? Any pointers will be of great help

Topic: search-engine deep-learning machine-learning

Category: Data Science

Is Elastic Search recommended if attribute getting search is not a huge text document?

user3422929

2020年1月24日 22:02

We are currently developing a system with MEAN stack with Mongodb at backend. We have employees name, and Ids in our system and our client wants to get pretty good (Read: Google Like) search in our system to search for employees' records. He needs our system to recommend employees even if he has misspelled the name, etc. One of the suggestions from our development lead was that we should use elastic search but from what I have seen, elastic search …

Topic: search-engine mongodb search

Category: Data Science

Measuring quality of answers from QnA systems

Sandeep Bhutani

2020年1月21日 05:01

I am having a question answering system which is using Seq2Seq kind of architecture. Actually it is a transformer architecture. When a question is asked it gives startposition and endposition of answer along with their logits. The answer is formed by choosing the best logits span and final probability is calculated by summing the start and end logits. Now the problem is, I have multiple answer and many times the good answer is at 2nd or 3rd place (after sorting …

Topic: question-answering bert transformer search-engine

Category: Data Science

An exhaustive, representative test database in phrase search algorithm

Xiaohan Du

2019年7月26日 21:38

For a phrase searching algorithm, imagine the goal is to search for a name phrase and return matched results based on a pre-defined threshold. For example, searching for "Jon Smith" could return "Jon Smith", "Jonathan Smith", "Jonathan David Smith", "Jonathan Smith-Mikel", "Jonathan 'Smith' Mikel" etc. The plan is to manually choose N test cases and put them in a benchmark database. I have concern about this plan because the test cases is likely to be not exhaustive. I know there …

Topic: search-engine nlp

Category: Data Science

scalable tools to build kNN graph over sparse data

xiaohan2012

2018年7月20日 22:20

I'm looking for scalable tools to build kNN graph over sparse data points. The dimension and number of data points can be both up to millions. What I have tried already: sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time. flann: only supports dense arrays pysparnn: the running time is not very satisfatory (maybe because it's written in Python) knn search in mlpack: which only supports dense data scipy.spatial.KDTree: which converts the sparse data to dense one SparseLSH: …

Topic: k-nn search-engine data-mining machine-learning

Category: Data Science

Can Google really bring back billions of results in a blink of an eye (almost)

Saleh

2018年7月12日 14:22

I was always fascinated by Google's search ability, a great achievement by Google and other search engine providers also, but more so a collective human talent and ability that makes me appreciate our amazing mind and our potential to innovate. I use Google search daily and I am sometimes disappointed with some very few words that would give no results which I accept to an extent. One of these instances led me to do a further "mini" investigation/test on Google's …

Topic: search-engine data visualization bigdata

Category: Data Science

Search Query Sample Size Determination for validation set

D.S.

2018年5月22日 11:26

While designing a search system, which searches in N identifiable categories, how many search queries does one need in each category to validate the target metric (DCG) scores accurately (balanced variance and bias)? does this number depend on N or the corpus size or both? Please add any publications possible. I would also like the understand if effect size and and bayesian effective sample sizes play some role here. Given a set of search queries Q for retrieving documents from …

Topic: learning-to-rank search-engine sampling search statistics

Category: Data Science

Finding similar articles in realtime

rubmz

2017年11月7日 14:09

I want to build a large document (news article) searchable database, such as when adding a new article I will be able to quickly find X most similar articles from it. What is the right tech/algorithm/Python framework to approach this?

Topic: similar-documents search-engine python similarity algorithms

Category: Data Science

Best method for similarity searching on 10,000 data points with 8,000 features each in Python?

Michael Vander Meiden

2017年7月31日 17:51

As mentioned in the title I am attempting to search through 10,000 vectors with 8000 features each, all in Python. Currently I have the vectors saved in their own directories as pickled numpy arrays. The features were pulled from this deep neural network. I am new to this space, but I have heard about M-Trees, R-Trees, inverted tables, and hashing. Is any one of them better for such a large number of features? This implementation needs to be done quite …

Topic: search-engine python search

Category: Data Science