I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …
Given a corpus of product descriptions (say, vacuum cleaners), I'm looking for a way to group the documents that are all of the same type (where a type can be cordless vacuums, shampooer, carpet cleaner, industrial vacuum, etc.). The approach I'm exploring is to use NER. I'm labeling a set of these documents with tags such as (KIND, BRAND, MODEL). The theory is that I'd then run new documents through the model, and the tokens corresponding to those tags would …
There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …
How to do template matching without OpenCV? I have an order invoice of documents belonging to Amazon, eBay, Flipkart, SnapDeal, and I want to extract less information from the order invoice. Since the fields like the order number, customer name, order details will be present at different positions in these 4 templates, I need to first classify to which of these 4 templates the input image will belong to and after identifying the template I can do my next work …
To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …
I have a task to provide semantic searching capabilities. For example, if I have a dataset of resume and if I search for "machine learning" than it should return me all resumes which have data science-related skills despite of missing "machine learning" keyword. How do we search the data through its meaning and related keywords I wonder? I have checked many algorithms also Like LSA, LDA, LSI but cannot find a resource which gives the implementation of the above.
I have a corpus of 23000 documents that need to be classified into 5 different categories. I do not have any labeled data available to me, just freeform text documents and labels(yes, one-word labels, not topics). So I followed a 2-step approach: Synthetically generate labeled data (using a rule-based labeling approach, obviously the recall is very low, ~ 1/8 documents are labeled) Somehow, use this labeled data to identify labels for other documents. I have attempted the following approaches for …
I have 100 sentences that I want to cluster based on similarity. I've used doc2vec to vectorize the sentences into 20 dimensional vectors and applied kmeans to cluster them. I haven't got the desired results yet. I've read that doc2vec performs well only on large datasets. I want to know if increasing the length of each data sample, would compensate for the low number of samples, and help the model train better? For example, if my sentences are originally "making …
I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset …
I know classifying images using cnn but I have a problem where I have multiple types of scanned documents in a pdf file on different pages. Some types of scanned documents present in multiple pages inside the pdf. Now I have to classify and return which documents are present and the page numbers in which they present in the pdf document. If scanned document is in multiple pages I should return the range of page numbers like "1 - 10". …
I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?
I followed gensim's Core Tutorial and build an LSA Classification, topic modeling and Document Similarity model for newsgroups dataset. My code is available here. I need help with below 3 concepts. Topic Classification: I get only 50% accuracy with KNN algo. Topic Modeling: The words highlighted for each of the 20 topics doesnt stand out. Document Similarity: I wrote a small test code to find that document similarity also doesnt produce great results. I am going to follow up it …
I have a huge dataset (>10M) of text files, which I try to de-duplicate - not only in terms of trivial duplicates, but also "near-duplicates", given some similarity threshold. I know that LSH (locality sensitive hashing) algorithm would be good option, but I don't know how to tackle the last phase of the processing. Currently, I have the following: Generate signatures for all of the text files Compute hashes (perform the LSH) Group documents from the same bucket & hash …
I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria I should consider, if any, about how to treat compound words/phrases in Doc2Vec for the purposes of calculating similarity. Were I just using tf-idf or something more straightforward, I'd consider going through each phrase and combining the words manually during …
I have two different set of documents S1, S2, with 30 text documents each. Using some text representation method, such as tfidf and a distance measure, such as cosine similarity, I want to match similar documents from the two sets S1, S2. For example D1 from S1 is similar (say 0.36 similar ) to D28 from S2. My problem is that Tfidf.Vectorizer() creates an array of 30, 5000 for S1 and 30, 4500 for S2, with 30 rows for each …
I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that deploying a supervised model is infeasible due to resource constraints that are not necessarily data science related (gathering labels is expensive, infrastructure cannot be approved for supervised models for whatever reason etc). Approaches I have considered: tf-idf …
I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it. Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into …
I'm trying to figure out the best way to group customers based on checkout items in their shopping cart. I have the basket, and what's in the basket, but am at a complete loss on how to group all the similar baskets. I have a group of users I believe shouldn't be counted in my overall metrics (or at least acknowledge them). These users create a new account, place 4-5 items in their cart, and check out. Then a new …
I am trying to develop an NLP - CNN algorithm to detect documents with sensitive information such as passport, license and distinguish them from other documents like resume, email, form or advertisements. I personally consider this as a document classification problem and looked for open source datasets which had documents from different category/classes. I found the RVL-CDIP Dataset and tobacco3482 dataset with classes such as Email, form, letter, news, resume, scientific. However, the dataset collection looks from an old collection …
I have developed a content-based recommendation system and it is working fine. The input is a set of documents={d1,d2,d3,...,dn} and the output will be Top N similar documents for a given document output={d10,d11,d1,d8,...}. I eyeballed the results and found it to be satisfactory, the question I have is how do I measure the performance, accuracy of the system. I did some research and found that recall, precision, and F1-score are used to evaluating the recommendation systems that predict user ratings. …