similar-documents

Building a graph out of a large text corpus

kevin_was_here

2022年5月28日 10:19

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …

Topic: similar-documents graphs text-mining nlp similarity

Category: Data Science

Deep learning techniques for concept similarity?

rodrigo-silveira

2022年5月25日 18:16

Given a corpus of product descriptions (say, vacuum cleaners), I'm looking for a way to group the documents that are all of the same type (where a type can be cordless vacuums, shampooer, carpet cleaner, industrial vacuum, etc.). The approach I'm exploring is to use NER. I'm labeling a set of these documents with tags such as (KIND, BRAND, MODEL). The theory is that I'd then run new documents through the model, and the tokens corresponding to those tags would …

Topic: similar-documents deep-learning nlp

Category: Data Science

Semantic Search

Farhaan Bukhsh

2022年5月1日 17:03

There is a problem we are trying to solve where we want to do semantic search on our set of data, i.e we have a domain specific data (example: sentences talking about automobiles) Our data is just a bunch of sentences and what we want is to give a phrase and get back the sentences which are: Similar to that phrase Has a part of sentence that is similar to the phrase Sentence which is having contextually similar meanings Let …

Topic: semantic-similarity similar-documents unsupervised-learning word-embeddings similarity

Category: Data Science

How to do template matching without opencv?

Aj_MLstater

2022年4月19日 22:06

How to do template matching without OpenCV? I have an order invoice of documents belonging to Amazon, eBay, Flipkart, SnapDeal, and I want to extract less information from the order invoice. Since the fields like the order number, customer name, order details will be present at different positions in these 4 templates, I need to first classify to which of these 4 templates the input image will belong to and after identifying the template I can do my next work …

Topic: similar-documents cnn classification

Category: Data Science

Document Similarity with User Preference

JoyfulPanda

2022年3月17日 02:30

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …

Topic: semantic-similarity similar-documents cosine-distance tfidf similarity

Category: Data Science

How to implement Semantic Search in R or Python

Yash Kanojia

2022年2月24日 11:07

I have a task to provide semantic searching capabilities. For example, if I have a dataset of resume and if I search for "machine learning" than it should return me all resumes which have data science-related skills despite of missing "machine learning" keyword. How do we search the data through its meaning and related keywords I wonder? I have checked many algorithms also Like LSA, LDA, LSI but cannot find a resource which gives the implementation of the above.

Topic: similar-documents deep-learning information-retrieval machine-learning

Category: Data Science

Is it possible to classify documents of corpus using labels?

Uttakarsh Tikku

2022年2月22日 15:00

I have a corpus of 23000 documents that need to be classified into 5 different categories. I do not have any labeled data available to me, just freeform text documents and labels(yes, one-word labels, not topics). So I followed a 2-step approach: Synthetically generate labeled data (using a rule-based labeling approach, obviously the recall is very low, ~ 1/8 documents are labeled) Somehow, use this labeled data to identify labels for other documents. I have attempted the following approaches for …

Topic: document-term-matrix text-classification similar-documents nlp

Category: Data Science

Length of document in doc2vec

Aishwarya A R

2022年2月8日 16:06

I have 100 sentences that I want to cluster based on similarity. I've used doc2vec to vectorize the sentences into 20 dimensional vectors and applied kmeans to cluster them. I haven't got the desired results yet. I've read that doc2vec performs well only on large datasets. I want to know if increasing the length of each data sample, would compensate for the low number of samples, and help the model train better? For example, if my sentences are originally "making …

Topic: similar-documents gensim python machine-learning

Category: Data Science

Document ranking on a web scraped dataset without any labelled data

sarva

2022年2月4日 05:04

I want to create a document ranking model which returns similar rows in the dataset for a sample query. The text in this corpus is standard english but without any labels (ie no query-related documents structure). Is it possible to use a pretrained model trained on a large corpus (like bert or word2vec) and use it directly on the scraped dataset without any evaluation and get decent results? If not this, is training a model on the MS macro dataset …

Topic: bert similar-documents text-mining nlp information-retrieval

Category: Data Science

Classification of scanned documents in pdf files using deep learning or NLP

Sherlock

2022年2月3日 14:03

I know classifying images using cnn but I have a problem where I have multiple types of scanned documents in a pdf file on different pages. Some types of scanned documents present in multiple pages inside the pdf. Now I have to classify and return which documents are present and the page numbers in which they present in the pdf document. If scanned document is in multiple pages I should return the range of page numbers like "1 - 10". …

Topic: similar-documents image-classification deep-learning nlp python

Category: Data Science

Cosine vs Manhattan for Text Similarity

Mohy Mohamed

2021年6月26日 22:40

I'm storing sentences in Elasticsearch as dense_vector field and used BERT for the embedding so each vector is 768 dim. Elasticsearch gives similarity function options like Euclidean, Manhattan and cosine similarity. I have tried them and both Manhattan and cosine gives me very similar and good results and now i don't know which one should i choose ?

Topic: semantic-similarity manhattan similar-documents cosine-distance

Category: Data Science

LSA Model Improvement

Bala

2021年6月17日 16:48

I followed gensim's Core Tutorial and build an LSA Classification, topic modeling and Document Similarity model for newsgroups dataset. My code is available here. I need help with below 3 concepts. Topic Classification: I get only 50% accuracy with KNN algo. Topic Modeling: The words highlighted for each of the 20 topics doesnt stand out. Document Similarity: I wrote a small test code to find that document similarity also doesnt produce great results. I am going to follow up it …

Topic: similar-documents lsi gensim lda topic-model

Category: Data Science

Performing actual deduplication using LSH

Marcin Zablocki

2021年6月16日 09:48

I have a huge dataset (>10M) of text files, which I try to de-duplicate - not only in terms of trivial duplicates, but also "near-duplicates", given some similarity threshold. I know that LSH (locality sensitive hashing) algorithm would be good option, but I don't know how to tackle the last phase of the processing. Currently, I have the following: Generate signatures for all of the text files Compute hashes (perform the LSH) Group documents from the same bucket & hash …

Topic: similar-documents preprocessing similarity data-mining

Category: Data Science

Preprocessing for Document Similarity Using Doc2Vec

user118648

2021年6月1日 19:18

I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria I should consider, if any, about how to treat compound words/phrases in Doc2Vec for the purposes of calculating similarity. Were I just using tf-idf or something more straightforward, I'd consider going through each phrase and combining the words manually during …

Topic: doc2vec similar-documents

Category: Data Science

Matching documents from different sets with tfidf and cosine distance

forgotten_novel_char

2021年4月16日 10:14

I have two different set of documents S1, S2, with 30 text documents each. Using some text representation method, such as tfidf and a distance measure, such as cosine similarity, I want to match similar documents from the two sets S1, S2. For example D1 from S1 is similar (say 0.36 similar ) to D28 from S2. My problem is that Tfidf.Vectorizer() creates an array of 30, 5000 for S1 and 30, 4500 for S2, with 30 rows for each …

Topic: document-term-matrix similar-documents cosine-distance tfidf dataset

Category: Data Science

Unsupervised document similarity state of the art

user7017793

2021年4月7日 07:19

I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that deploying a supervised model is infeasible due to resource constraints that are not necessarily data science related (gathering labels is expensive, infrastructure cannot be approved for supervised models for whatever reason etc). Approaches I have considered: tf-idf …

Topic: similar-documents unsupervised-learning

Category: Data Science

Data wrangling for a big set of docx files advice!

mess1n

2021年3月17日 15:28

I'm looking for some advice on a data wrangling problem I'm trying to solve. I've spent a week solid taking different approaches and nothing seems to be quite perfect. Just FYI, this is my first big (for me anyway) data science project, so I'm really in need of some wisdom on the best way to approach it. Essentially I have a set (200+) of docx files that are semi-structured. By semi-structured I mean the information I want is organized into …

Topic: similar-documents data-wrangling python

Category: Data Science

Scalable way to group users with similar titles purchased

travistravis

2021年3月17日 06:05

I'm trying to figure out the best way to group customers based on checkout items in their shopping cart. I have the basket, and what's in the basket, but am at a complete loss on how to group all the similar baskets. I have a group of users I believe shouldn't be counted in my overall metrics (or at least acknowledge them). These users create a new account, place 4-5 items in their cart, and check out. Then a new …

Topic: similar-documents similarity clustering

Category: Data Science

Looking for more recent dataset for document classfication

siya m

2021年2月18日 16:37

I am trying to develop an NLP - CNN algorithm to detect documents with sensitive information such as passport, license and distinguish them from other documents like resume, email, form or advertisements. I personally consider this as a document classification problem and looked for open source datasets which had documents from different category/classes. I found the RVL-CDIP Dataset and tobacco3482 dataset with classes such as Email, form, letter, news, resume, scientific. However, the dataset collection looks from an old collection …

Topic: similar-documents image-classification classification nlp

Category: Data Science

Evaluation of recommendation systems

Raj

2020年12月11日 00:19

I have developed a content-based recommendation system and it is working fine. The input is a set of documents={d1,d2,d3,...,dn} and the output will be Top N similar documents for a given document output={d10,d11,d1,d8,...}. I eyeballed the results and found it to be satisfactory, the question I have is how do I measure the performance, accuracy of the system. I did some research and found that recall, precision, and F1-score are used to evaluating the recommendation systems that predict user ratings. …

Topic: similar-documents evaluation information-retrieval recommender-system

Category: Data Science

About