How can I improve the recall of a certain class in a multiclass-classification result

I am working on a multiclass classification which is to assign medical related queries of web search to certain departments of hospital.My classifier is based on the fastText. I found for most conditions, the result is good enough say recall is 0.8 for Nephrology. However, for just one department, Dermatology, the recall is pretty low,like 0.5. Unfortunately, this label has most samples in the test data. How can I improve the recall of one class while maintain the performance of …
Category: Data Science

Learning to Rank with Unlabelled Dataset

I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?
Category: Data Science

Best way to vectorise names and addresses for similarity searching?

I have a large dataset of around 9 million people with names and addresses. Given quirks of the process used to get the data it is highly likely that a person is in the dataset more than once, with subtle differences between each record. I want to identify a person and their 'similar' personas with some sort of confidence metric for the alternative records identified. My inital thoughts on an approach is to vectorise each name and address as a …
Category: Data Science

Best methods to choose between different searching models?

My question here is in regards to best practices and current methods for selecting search models on the fly based on a users query. Lets say I have four searching models, each optimized for their respective types: Model A: Embedding-based, used for sentence queries about scientific topics Model B: Embedding-based, used for sentence queries about general news topics Model C: TF*IDF-based, used for keyword queries about scientific topics Model D: TF*IDF-based, used for keyword queries about general news topics When …
Category: Data Science

Is this a potentially acceptable way to compare Google result quantities?

I’ve been recently trying to compare internet presence of a species to trend data I have collected. After reading a stack of papers on hit count estimates, I’m well aware that the number of results are, at best, an estimate. My question is, how far off would it be to compare terms that produce drastic differences in the number of results? For instance: “White-throated jay” OR “scientific name” yields roughly 7-14,00 results depending on the day/amount of time the query …
Category: Data Science

Why do we calculate the vector of a document by averaging the vectors of all the words?

I am trying to build a search engine to query a folder of documents. Tutorials online suggest that we should obtain the vector of a document by averaging the vectors of all the words, then compare similarity to the vector of the query. May I know how does the vector of all the words in the document retain the information of the words? Would it be better if i retrieved similar words of the query and checked if these words …
Category: Data Science

Could you generate search queries to poison data analysis by a search engine?

A simple problem with search engines is that you have to trust that they will not build a profile of search queries you submit. (Without Tor or e.g. homomorphic encryption, that is.) Suppose we put together a search engine server with a use policy that permits constant queries being sent by paid customers. The search engine's client transmits, at some frequency, generated search queries (e.g. markov, ML-generated, random dictionary words, sourced from news, whatever; up to you) in order to …
Category: Data Science

About Natural Question (NQ) benchmark in NLP

I recently learned that there is a benchmark called NQ. https://ai.google.com/research/NaturalQuestions/visualization Unlike other QA benchmarks which relevant document is povided with query, it has to find information from millions of corpus by itself. For example, if question is "when are hops added to the brewing process?" other QA benchmark also provide only 1 document about brewing. While NQ provide whole wikipedia text and model has to find most relevant document and answer. When I tried all the example in the …
Category: Data Science

What is the difference between Okapi bm25 and NMSLIB?

I was trying to make a search system and then I got to know about Okapi bm25 which is a ranking function like tf-idf. You can make an index of your corpus and later retrieve documents similar to your query. I imported a python library rank_bm25 and created a search system and the results were satisfying. Then I saw something called Non-metric space library. I understood that its a similarity search library much like kNN algorithm. I saw an example …
Category: Data Science

What is the formula and log base for idf?

To calculate tf-idf, we do: tf*idf tf=number of times word occurs in document What is formula for idf and log base: Log(number of documents/number of documents containing the word) Log((1+number of documents)/(1+number of documents containing the word)) 1+Log(number of documents/number of documents containing the word) 1+Log((1+number of documents)/(1+number of documents containing the word))
Category: Data Science

Is Elastic Search recommended if attribute getting search is not a huge text document?

We are currently developing a system with MEAN stack with Mongodb at backend. We have employees name, and Ids in our system and our client wants to get pretty good (Read: Google Like) search in our system to search for employees' records. He needs our system to recommend employees even if he has misspelled the name, etc. One of the suggestions from our development lead was that we should use elastic search but from what I have seen, elastic search …
Category: Data Science

Measuring quality of answers from QnA systems

I am having a question answering system which is using Seq2Seq kind of architecture. Actually it is a transformer architecture. When a question is asked it gives startposition and endposition of answer along with their logits. The answer is formed by choosing the best logits span and final probability is calculated by summing the start and end logits. Now the problem is, I have multiple answer and many times the good answer is at 2nd or 3rd place (after sorting …
Category: Data Science

An exhaustive, representative test database in phrase search algorithm

For a phrase searching algorithm, imagine the goal is to search for a name phrase and return matched results based on a pre-defined threshold. For example, searching for "Jon Smith" could return "Jon Smith", "Jonathan Smith", "Jonathan David Smith", "Jonathan Smith-Mikel", "Jonathan 'Smith' Mikel" etc. The plan is to manually choose N test cases and put them in a benchmark database. I have concern about this plan because the test cases is likely to be not exhaustive. I know there …
Category: Data Science

scalable tools to build kNN graph over sparse data

I'm looking for scalable tools to build kNN graph over sparse data points. The dimension and number of data points can be both up to millions. What I have tried already: sklearn.neighbors.kneighbors_graph: which does brute-force search for sparse data, giving quadratic time. flann: only supports dense arrays pysparnn: the running time is not very satisfatory (maybe because it's written in Python) knn search in mlpack: which only supports dense data scipy.spatial.KDTree: which converts the sparse data to dense one SparseLSH: …
Category: Data Science

Can Google really bring back billions of results in a blink of an eye (almost)

I was always fascinated by Google's search ability, a great achievement by Google and other search engine providers also, but more so a collective human talent and ability that makes me appreciate our amazing mind and our potential to innovate. I use Google search daily and I am sometimes disappointed with some very few words that would give no results which I accept to an extent. One of these instances led me to do a further "mini" investigation/test on Google's …
Category: Data Science

Search Query Sample Size Determination for validation set

While designing a search system, which searches in N identifiable categories, how many search queries does one need in each category to validate the target metric (DCG) scores accurately (balanced variance and bias)? does this number depend on N or the corpus size or both? Please add any publications possible. I would also like the understand if effect size and and bayesian effective sample sizes play some role here. Given a set of search queries Q for retrieving documents from …
Category: Data Science

Best method for similarity searching on 10,000 data points with 8,000 features each in Python?

As mentioned in the title I am attempting to search through 10,000 vectors with 8000 features each, all in Python. Currently I have the vectors saved in their own directories as pickled numpy arrays. The features were pulled from this deep neural network. I am new to this space, but I have heard about M-Trees, R-Trees, inverted tables, and hashing. Is any one of them better for such a large number of features? This implementation needs to be done quite …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.