Vectorize One line text data

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results. I have purchase order descriptions which are one-liners and I need to classify. It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating. My take is that TF IDF should not …
Topic: tfidf nlp
Category: Data Science

How do I use TF*IDF scores for my machine learning model?

I have applied TF*IDF on the 'Ad-topic line' column of my dataset. For every ad-topic line, I get the same output: Firstly, I am unable to make sense of the output. The TF*IDF values are mentioned to the right, but what exactly are the numbers in brackets? I plan to use these for my logistic regression model for classification. How exactly do I feed these values to the algorithm?
Category: Data Science

How to combine nlp and numeric data for a linear regression problem

I'm very new to data science (this is my hello world project), and I have a data set made up of a combination of review text and numerical data such as number of tables. There is also a column for reviews which is a float (avg of all user reviews for that restaurant). So a row of data could be like: { rating: 3.765, review: `Food was great, staff was friendly`, tables: 30, staff: 15, parking: 20 ... } So …
Category: Data Science

Assigning a new document to a cluster based on keywords extracted and tf-idf

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided. Now I want to assign new documents to these clusters. I found that it is possible to extract keywords using tf-idf based methods as mentioned here. My approach is to extract key terms from each of these clusters using tf-idf based method and I …
Category: Data Science

'list' object has no attribute 'lower' TfidfVectorizer

I have a dataframe with two text columns and I converted them to a list. I seperated the train and test data as well. But while making a base model TfidfVectorizer throws me an error of 'list' object has no attribute 'lower' Here is the code X['ItemDescription']= X['ItemDescription'].str.lower() X['DiagnosisOne'] = X['DiagnosisOne'].str.lower() from sklearn.model_selection import train_test_split X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42) # Convert abstract text lines into lists train_items = X_train.reset_index().values.tolist() test_items = X_test.reset_index().values.tolist() from sklearn.preprocessing import LabelEncoder label_encoder = …
Category: Data Science

Classification using texts as features

I want to build a classification model to match customers and products. I have a description of each product, and a description of each customer, and the label : customer *i* buy/did not buy product *j*. Each sample/row is a pair (customer, product), so Feature 1 is customer's description, Feature 2 is product's description, and the target variable y is: "y = 1 : customer buys product", "y = 0 otherwise". The goal is to predict for new arriving products …
Category: Data Science

KDE on TF-IDF - sensitive bandwidth

I am clustering text based on TF-IDF features and DBSCAN (density based), and trying to rank points based on their 'belonging' to the cluster. Since my clustering is density based and my points can spread very randomly, I found Kernel Density Estimators relevant. However, the scores of KDE are very sensitive to the choice of bandwidth hyper-parameter, which I could not pre-estimate. Most bandwidth values end up with either infinite score for points outside the cluster, of zero for the …
Category: Data Science

How best to embed large and noisy documents

I have a large corpus of documents (web pages) collected from various sites of around 10k-30k chars each, I am processing them to extract relevant text as much as possible, but they are never perfect. Right now I creating a doc for each page, processing it with TFIDF and then creating a dense feature vector using UMAP. My final goal is to really pick out the differences in the articles, for similarity analysis, clustering and classification - however at this …
Category: Data Science

Optimal clusters for K-means not clear - any ideas?

I have a toy dataset of 10,000 strings of people's names, addresses and birthdays. As a quirk of the data collection process it is highly likely there are duplicate people caused by typos and I am trying to cluster them using K-means. I know there are easier ways of doing this, but the reason I am doing it like this is out of curiosity. In order to vectorize each person I am concatenating the strings as follows: [name][address][birthday] and then …
Category: Data Science

How to have a fixed no of features for input layer of a neural network when using TF-IDF

So basically my question is hypothetically lets say: I have a column containing 2000 rows of texts, and when I apply tf-idf, I get 27 features like shown below. Now once I do that, I could consider my Neural Network's Input layer's number of neurons to be 27, like shown below, and i train the model with the tf-idf features. Now, hypothetically speaking, if I'm trying to test this model with one string (a short string), and when we apply …
Category: Data Science

Naive Bayes TfidfVectorizer predicts everything to one class

I'm trying to run Multinomial Bayes classificator on various balanced data sets and comparing 2 different vectorizers: TfidfVectorizer and CountVectorizer. I have 3 classes: NEG, NEU and POS. I have 10000 documents. NEG class has 2474, NEU 5894 and POS 1632. Out of that I have made 3 differently balanced data sets like this: text counts: NEU NEG POS Total number NEU balance dataset 5894 2474 1632 10000 NEG balance dataset 2474 2474 1632 6580 POS balance dataset 1632 1632 …
Category: Data Science

How to match a corpus with a string of words using a TF-IDF matrix?

I am trying to match strings of words with a website that has bulletpoints whose text is most similar to it. The way I thought of doing it is to get all of the documents from each bulletpoint into one corpus per website, that I would like to match a string of words with, discard stop words, and then lemmatize everything. Then, for each string of text, I create a TF-IDF sparse matrix, with each row the text from a …
Category: Data Science

NLP Basic input doubt

I actually have a basic doubt in NLP, When we consider traditional models like Decision trees, The feature column order is important, Like first column is fixed with some particular attribute. So If, I have Tf-Idf Each word will have some fixed index and the model can learn. But in the case of LSTM, Sentences can be jumbled. For eg: "There is heavy rain", "Heavy rain is there" In the above 2 sentences, The word heavy occurs in different places. …
Category: Data Science

How to justify logarithmically scaled frequency for tf in tf-idf?

I am studying tf-idf (term frequency - inverse document frequency). The original logic for tf was straightforward: count of term t / number of total terms in the document. However, I came across the log scaled frequency: log(1 + count of term t in the document). Please refer to Wikipedia. It does not include the number of total terms in a document. For example, say, document 1 has 10 words in total and one of them is "happy". Using the …
Category: Data Science

Why does using a standard scalar on my tf idf matrix make it perform better?

I have a TF-IDF matrix transformed on a list of tweets from a data set I am using. I have a pipeline where I initiate a StandardScalar and then next have my SVM with a linear kernel and auto gamma as the classifier algorithm. Pretty much as done here in the examples section. With the pipeline, the classifier scores an 87 f1 score. Without the pipe, it scores a dismal 53. Why is this? I thought TF-IDF values were already …
Category: Data Science

Document Similarity with User Preference

To measure the similarity between two documents, one can use, e.g. TF-IDF/Cosine Similarity. Supposing that after calculating the similarity scores of Doc A against a list of Documents (Doc B, Doc C,...), we got: Document Pair Similarity Score Doc A vs. Doc B 0.45 Doc A vs. Doc C 0.30 Doc A vs. ... ... Of course, Doc B seems to be the closest one, in terms of similarity, for Doc A. But what if Users, as humans, think Doc …
Category: Data Science

What are the exact differences between Word Embedding and Word Vectorization?

I am learning NLP. I have tried to figure out the exact difference between Word Embedding and Word Vectorization. However, seems like some articles use these words interchangeably. But I think there must be some sort of differences. In Vectorization, I came across these vectorizers: CountVectorizer, HashingVectorizer, TFIDFVectorizer Moreover, while I was trying to understand the word embedding. I found these tools. Bag of words, Word2Vec Would you please briefly summarize the differences and the algorithms of between Word Embeddings …
Category: Data Science

How to Combine tfidf with LSTM in keras?

I am classifying emails as spam or ham using LSTM and some of its modified form(by adding constitutional layer at the end). For converting documents into vectors I am using keras.text_to_sequences function. But now I want to use TfIdf with the LSTM can anyone tell me or share the code how to do it. Please also guide me if it is possible and good approach or not. If you are wondering why I would like to do this there are …
Topic: keras tfidf lda nlp
Category: Data Science

How to decide which method to use TFIDF, or BOW

In a huge dataset for NLP it is taking very long time to classify my dataset therefore, trying each feature extraction method separetly is time consuming and not effecient. Is there a way that can tell me which method (TFIDF or Bag Of Words) is more likely to give the highest F1 score. I tried test them on smaller subset (1000 records) it was fast but best method in smaller subset does not mean it is the best in complete …
Category: Data Science

Input 0 of layer max_pooling1d_3 is incompatible with the layer Error

Ok, so basically, i have some Tf-Idf features and some additional features like wordcount, sentiment on my data. Now, according to my knowledge, when we use Convolutional layer, the data needs to be converted to dimensional vectors. As shown below, is me converting them. X_train_reshaped = X_train.reshape(X_train.shape[0], 3, 10, 1) y_train_reshaped = y_train.reshape(y_train.shape[0], 1, 1,1) Below is the shape This is how X_train_reshaped is shown, Now, below is my model that I have declared. model = Sequential() model.add(Conv1D(filters=3, kernel_size=1, activation='relu', …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.