I am working on a multiclass classification which is to assign medical related queries of web search to certain departments of hospital.My classifier is based on the fastText. I found for most conditions, the result is good enough say recall is 0.8 for Nephrology. However, for just one department, Dermatology, the recall is pretty low,like 0.5. Unfortunately, this label has most samples in the test data. How can I improve the recall of one class while maintain the performance of …
I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?
I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified. My inital thoughts on an approach is to vectorise each name and address as a …
My question here is in regards to best practices and current methods for selecting search models on the fly based on a users query. Lets say I have four searching models, each optimized for their respective types: Model A: Embedding-based, used for sentence queries about scientific topics Model B: Embedding-based, used for sentence queries about general news topics Model C: TF*IDF-based, used for keyword queries about scientific topics Model D: TF*IDF-based, used for keyword queries about general news topics When …
I’ve been recently trying to compare internet presence of a species to trend data I have collected. After reading a stack of papers on hit count estimates, I’m well aware that the number of results are, at best, an estimate. My question is, how far off would it be to compare terms that produce drastic differences in the number of results? For instance: “White-throated jay” OR “scientific name” yields roughly 7-14,00 results depending on the day/amount of time the query …
I am trying to build a search engine to query a folder of documents. Tutorials online suggest that we should obtain the vector of a document by averaging the vectors of all the words, then compare similarity to the vector of the query. May I know how does the vector of all the words in the document retain the information of the words? Would it be better if i retrieved similar words of the query and checked if these words …
A simple problem with search engines is that you have to trust that they will not build a profile of search queries you submit. (Without Tor or e.g. homomorphic encryption, that is.) Suppose we put together a search engine server with a use policy that permits constant queries being sent by paid customers. The search engine's client transmits, at some frequency, generated search queries (e.g. markov, ML-generated, random dictionary words, sourced from news, whatever; up to you) in order to …
I recently learned that there is a benchmark called NQ. https://ai.google.com/research/NaturalQuestions/visualization Unlike other QA benchmarks which relevant document is povided with query, it has to find information from millions of corpus by itself. For example, if question is "when are hops added to the brewing process?" other QA benchmark also provide only 1 document about brewing. While NQ provide whole wikipedia text and model has to find most relevant document and answer. When I tried all the example in the …
I was trying to make a search system and then I got to know about Okapi bm25 which is a ranking function like tf-idf. You can make an index of your corpus and later retrieve documents similar to your query. I imported a python library rank_bm25 and created a search system and the results were satisfying. Then I saw something called Non-metric space library. I understood that its a similarity search library much like kNN algorithm. I saw an example …
To calculate tf-idf, we do: tf*idf tf=number of times word occurs in document What is formula for idf and log base: Log(number of documents/number of documents containing the word) Log((1+number of documents)/(1+number of documents containing the word)) 1+Log(number of documents/number of documents containing the word) 1+Log((1+number of documents)/(1+number of documents containing the word))
Video platforms like YouTube, Netflix, Amazon prime have an excellent search system - given a search string, find most relevant videos. Which Machine Learning /Deep Learning techniques used for this? Any pointers will be of great help
We are currently developing a system with MEAN stack with Mongodb at backend. We have employees name, and Ids in our system and our client wants to get pretty good (Read: Google Like) search in our system to search for employees' records. He needs our system to recommend employees even if he has misspelled the name, etc. One of the suggestions from our development lead was that we should use elastic search but from what I have seen, elastic search …
I am having a question answering system which is using Seq2Seq kind of architecture. Actually it is a transformer architecture. When a question is asked it gives startposition and endposition of answer along with their logits. The answer is formed by choosing the best logits span and final probability is calculated by summing the start and end logits. Now the problem is, I have multiple answer and many times the good answer is at 2nd or 3rd place (after sorting …
For a phrase searching algorithm, imagine the goal is to search for a name phrase and return matched results based on a pre-defined threshold. For example, searching for "Jon Smith" could return "Jon Smith", "Jonathan Smith", "Jonathan David Smith", "Jonathan Smith-Mikel", "Jonathan 'Smith' Mikel" etc. The plan is to manually choose N test cases and put them in a benchmark database. I have concern about this plan because the test cases is likely to be not exhaustive. I know there …
I'm looking for scalable tools to build kNN graph over sparse data points. The dimension and number of data points can be both up to millions. What I have tried already: sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time. flann: only supports dense arrays pysparnn: the running time is not very satisfatory (maybe because it's written in Python) knn search in mlpack: which only supports dense data scipy.spatial.KDTree: which converts the sparse data to dense one SparseLSH: …
I was always fascinated by Google's search ability, a great achievement by Google and other search engine providers also, but more so a collective human talent and ability that makes me appreciate our amazing mind and our potential to innovate. I use Google search daily and I am sometimes disappointed with some very few words that would give no results which I accept to an extent. One of these instances led me to do a further "mini" investigation/test on Google's …
While designing a search system, which searches in N identifiable categories, how many search queries does one need in each category to validate the target metric (DCG) scores accurately (balanced variance and bias)? does this number depend on N or the corpus size or both? Please add any publications possible. I would also like the understand if effect size and and bayesian effective sample sizes play some role here. Given a set of search queries Q for retrieving documents from …
I want to build a large document (news article) searchable database, such as when adding a new article I will be able to quickly find X most similar articles from it. What is the right tech/algorithm/Python framework to approach this?
As mentioned in the title I am attempting to search through 10,000 vectors with 8000 features each, all in Python. Currently I have the vectors saved in their own directories as pickled numpy arrays. The features were pulled from this deep neural network. I am new to this space, but I have heard about M-Trees, R-Trees, inverted tables, and hashing. Is any one of them better for such a large number of features? This implementation needs to be done quite …