tfidf

Vectorize One line text data

Payal Bhatia

2022年6月3日 03:08

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results. I have purchase order descriptions which are one-liners and I need to classify. It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating. My take is that TF IDF should not …

Topic: tfidf nlp

Category: Data Science

How do I use TF*IDF scores for my machine learning model?

Apollo

2022年5月30日 02:03

I have applied TF*IDF on the 'Ad-topic line' column of my dataset. For every ad-topic line, I get the same output: Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets? I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?

Topic: tfidf feature-extraction machine-learning

Category: Data Science

How to combine nlp and numeric data for a linear regression problem

davidm

2022年5月29日 23:02

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …

Topic: tfidf linear-regression scikit-learn nlp

Category: Data Science

Assigning a new document to a cluster based on keywords extracted and tf-idf

Kami

2022年5月27日 05:05

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided. Now I want to assign new documents to these clusters. I found that it is possible to extract keywords using tf-idf based methods as mentioned here. My approach is to extract key terms from each of these clusters using tf-idf based method and I …

Topic: tfidf similarity clustering

Category: Data Science

'list' object has no attribute 'lower' TfidfVectorizer

Tanvi Punjani

2022年5月15日 16:03

I have a dataframe with two text columns and I converted them to a list. I seperated the train and test data as well. But while making a base model TfidfVectorizer throws me an error of 'list' object has no attribute 'lower' Here is the code X['ItemDescription']= X['ItemDescription'].str.lower() X['DiagnosisOne'] = X['DiagnosisOne'].str.lower() from sklearn.model_selection import train_test_split X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42) # Convert abstract text lines into lists train_items = X_train.reset_index().values.tolist() test_items = X_test.reset_index().values.tolist() from sklearn.preprocessing import LabelEncoder label_encoder = …

Topic: tfidf multiclass-classification nlp

Category: Data Science

Classification using texts as features

sgduran91

2022年5月14日 17:04

I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*. Each sample/row is a pair (customer, product), so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: "y = 1 : customer buys product", "y = 0 otherwise". The goal is to predict for new arriving products …

Topic: text-classification tfidf scikit-learn nlp machine-learning

Category: Data Science

KDE on TF-IDF - sensitive bandwidth

Adam

2022年5月9日 12:03

I am clustering text based on TF-IDF features and DBSCAN (density based), and trying to rank points based on their 'belonging' to the cluster. Since my clustering is density based and my points can spread very randomly, I found Kernel Density Estimators relevant. However, the scores of KDE are very sensitive to the choice of bandwidth hyper-parameter, which I could not pre-estimate. Most bandwidth values end up with either infinite score for points outside the cluster, of zero for the …

Topic: tfidf scikit-learn clustering

Category: Data Science

How best to embed large and noisy documents

dendog

2022年5月7日 20:03

I have a large corpus of documents (web pages) collected from various sites of around 10k-30k chars each, I am processing them to extract relevant text as much as possible, but they are never perfect. Right now I creating a doc for each page, processing it with TFIDF and then creating a dense feature vector using UMAP. My final goal is to really pick out the differences in the articles, for similarity analysis, clustering and classification - however at this …

Topic: tfidf word-embeddings nlp python machine-learning

Category: Data Science

Optimal clusters for K-means not clear - any ideas?

Sandy Lee

2022年5月5日 19:24

I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity. In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then …

Topic: tfidf scikit-learn nlp k-means clustering

Category: Data Science

How to have a fixed no of features for input layer of a neural network when using TF-IDF

Yeshan Santhush

2022年5月1日 23:00

So basically my question is hypothetically lets say: I have a column containing 2000 rows of texts, and when I apply tf-idf, I get 27 features like shown below. Now once I do that, I could consider my Neural Network's Input layer's number of neurons to be 27, like shown below, and i train the model with the tf-idf features. Now, hypothetically speaking, if I'm trying to test this model with one string (a short string), and when we apply …

Topic: tfidf deep-learning neural-network nlp machine-learning

Category: Data Science

Naive Bayes TfidfVectorizer predicts everything to one class

Justas Vasiljevas

2022年5月1日 14:47

I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this: text counts: NEU NEG POS Total number NEU balance dataset 5894 2474 1632 10000 NEG balance dataset 2474 2474 1632 6580 POS balance dataset 1632 1632 …

Topic: text-classification tfidf naive-bayes-classifier classification python

Category: Data Science

How to match a corpus with a string of words using a TF-IDF matrix?

sangstar

2022年4月26日 17:00

I am trying to match strings of words with a website that has bulletpoints whose text is most similar to it. The way I thought of doing it is to get all of the documents from each bulletpoint into one corpus per website, that I would like to match a string of words with, discard stop words, and then lemmatize everything. Then, for each string of text, I create a TF-IDF sparse matrix, with each row the text from a …

Topic: text-classification tfidf nlp

Category: Data Science

NLP Basic input doubt

mewbie

2022年4月11日 12:11

I actually have a basic doubt in NLP, When we consider traditional models like Decision trees, The feature column order is important, Like first column is fixed with some particular attribute. So If, I have Tf-Idf Each word will have some fixed index and the model can learn. But in the case of LSTM, Sentences can be jumbled. For eg: "There is heavy rain", "Heavy rain is there" In the above 2 sentences, The word heavy occurs in different places. …

Topic: lstm tfidf deep-learning nlp machine-learning

Category: Data Science

How to justify logarithmically scaled frequency for tf in tf-idf?

Fred Chang

2022年4月2日 22:00

I am studying tf-idf (term frequency - inverse document frequency). The original logic for tf was straightforward: count of term t / number of total terms in the document. However, I came across the log scaled frequency: log(1 + count of term t in the document). Please refer to Wikipedia. It does not include the number of total terms in a document. For example, say, document 1 has 10 words in total and one of them is "happy". Using the …

Topic: logarithmic tfidf nlp

Category: Data Science

Why does using a standard scalar on my tf idf matrix make it perform better?

Synikk

2022年3月21日 04:08

I have a TF-IDF matrix transformed on a list of tweets from a data set I am using. I have a pipeline where I initiate a StandardScalar and then next have my SVM with a linear kernel and auto gamma as the classifier algorithm. Pretty much as done here in the examples section. With the pipeline, the classifier scores an 87 f1 score. Without the pipe, it scores a dismal 53. Why is this? I thought TF-IDF values were already …

Topic: tfidf classification svm

Category: Data Science

Document Similarity with User Preference

JoyfulPanda

2022年3月17日 02:30

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …

Topic: semantic-similarity similar-documents cosine-distance tfidf similarity

Category: Data Science

What are the exact differences between Word Embedding and Word Vectorization?

Nahid

2022年3月13日 22:31

I am learning NLP. I have tried to figure out the exact difference between Word Embedding and Word Vectorization. However, seems like some articles use these words interchangeably. But I think there must be some sort of differences. In Vectorization, I came across these vectorizers: CountVectorizer, HashingVectorizer, TFIDFVectorizer Moreover, while I was trying to understand the word embedding. I found these tools. Bag of words, Word2Vec Would you please briefly summarize the differences and the algorithms of between Word Embeddings …

Topic: text-classification tfidf word2vec word-embeddings nlp

Category: Data Science

How to Combine tfidf with LSTM in keras?

AQEEL ALTAF

2022年3月10日 20:01

I am classifying emails as spam or ham using LSTM and some of its modified form(by adding constitutional layer at the end). For converting documents into vectors I am using keras.text_to_sequences function. But now I want to use TfIdf with the LSTM can anyone tell me or share the code how to do it. Please also guide me if it is possible and good approach or not. If you are wondering why I would like to do this there are …

Topic: keras tfidf lda nlp

Category: Data Science

How to decide which method to use TFIDF, or BOW

asmgx

2022年3月8日 12:05

In a huge dataset for NLP it is taking very long time to classify my dataset therefore, trying each feature extraction method separetly is time consuming and not effecient. Is there a way that can tell me which method (TFIDF or Bag Of Words) is more likely to give the highest F1 score. I tried test them on smaller subset (1000 records) it was fast but best method in smaller subset does not mean it is the best in complete …

Topic: bag-of-words tfidf feature-extraction nlp

Category: Data Science

Input 0 of layer max_pooling1d_3 is incompatible with the layer Error

Yeshan Santhush

2022年3月6日 08:35

Ok, so basically, i have some Tf-Idf features and some additional features like wordcount, sentiment on my data. Now, according to my knowledge, when we use Convolutional layer, the data needs to be converted to dimensional vectors. As shown below, is me converting them. X_train_reshaped = X_train.reshape(X_train.shape[0], 3, 10, 1) y_train_reshaped = y_train.reshape(y_train.shape[0], 1, 1,1) Below is the shape This is how X_train_reshaped is shown, Now, below is my model that I have declared. model = Sequential() model.add(Conv1D(filters=3, kernel_size=1, activation='relu', …

Topic: cnn tfidf deep-learning neural-network machine-learning

Category: Data Science

About