bag-of-words

How to use scikit-learn to extract features from text when I only have positive and unlabeled data?

rbaehr

2022年5月21日 08:03

I'm looking for something similar to this https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative. I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using https://pulearn.github.io/pulearn/doc/pulearn/ The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into …

Topic: bag-of-words text-classification scikit-learn feature-selection clustering

Category: Data Science

Should bag of words in training set include test set data when doing text classification?

jonnyf

2022年4月1日 13:00

I'm doing text classification to identify 'attacks' from Wikipedia comments using a simple bag of words model and a linear SVM classifier. Because of class imbalance, I'm using the F1 score as my error measure. I'm wondering if the tokens I have in the training data should also include words that exist only in the test data, or does it not matter? I was under the impression that it shouldn't matter since the counts for these features would be zero …

Topic: bag-of-words text-classification text-mining svm machine-learning

Category: Data Science

How to decide which method to use TFIDF, or BOW

asmgx

2022年3月8日 12:05

In a huge dataset for NLP it is taking very long time to classify my dataset therefore, trying each feature extraction method separetly is time consuming and not effecient. Is there a way that can tell me which method (TFIDF or Bag Of Words) is more likely to give the highest F1 score. I tried test them on smaller subset (1000 records) it was fast but best method in smaller subset does not mean it is the best in complete …

Topic: bag-of-words tfidf feature-extraction nlp

Category: Data Science

Is it good practice to remove the numeric values from the text data during preprocessing?

emily

2022年2月17日 20:29

Im doing preprocessing on a text dataset. I have certain numerics in it like: date(1st July) year(2019) tentative values (3-5 years/ 10+ advantages). unique values (room no 31/ user rank 45) percentage(100%) Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development? Any quick help on this is much appreciated. Thank you

Topic: bag-of-words hashingvectorizer tokenization tfidf nlp

Category: Data Science

Choosing an explainable embedding and classifier when each document only have one sentence

Xiiryo

2021年8月25日 16:55

I have dataset with corpus of 20K documents. Each document is a short 1 sentences. I need to classify each sentence in 0/1 classes as well as being able to point exactly what words are responsible for that. To make it more concrete, one of the classes is "Unclear vs Clear". The user make a request and we try to guess if his request is clear enough to be understood and processed by someone else. Then we want to show …

Topic: bag-of-words text-classification word2vec classification nlp

Category: Data Science

TF-IDF to find technical terms

n.mathfreak

2021年8月20日 06:43

I have some sentences and I want to see whether or not they contain words that are technical terms. I was thinking of working with Wikipedia texts: finding the most common words in a certain article, and if those words are rare among most of the other articles, then they are most likely technical terms. Does this make sense? I tried it using 3 specialized texts from my computer, from different areas, and the results were quite bad. I got …

Topic: bag-of-words tfidf word-embeddings nlp machine-learning

Category: Data Science

Getting context-word pairs for a continuous bag of words model and other confusions

sangstar

2021年7月31日 10:06

Suppose I have a corpus with documents: corpus = [ "The sky looks lovely today", "The fat cat hit the poor dog", "God created all men equal", "He wrestled the creature to the ground", "The king did not treat his subjects fairly", ] Which I've preprocessed, and want to generate context-word pairs, following this article. The writer notes: The preceding output should give you some more perspective of how X forms our context words and we are trying to predict …

Topic: context-vector bag-of-words word2vec word-embeddings

Category: Data Science

Usage of KL divergence to improve BOW model

Balocre

2021年7月10日 05:19

For a university project, I chose to do sentiment analysis on a Google Play store reviews dataset. I obtained decent results classifying the data using the bag of words (BOW) model and an ADALINE classifier. I would like to improve my model by incorporating bigrams relevant to the topic (Negative or Positive) in my features set. I found this paper which uses KL divergence to measure the relevance of unigrams/bigrams relative to a topic. The only problem is that I …

Topic: bag-of-words ngrams classification

Category: Data Science

How to decide to go with BOW or TFIDF

asmgx

2021年4月21日 13:58

I know that there are methods that help in selecting features such as Matual Info, and Info Gain, etc. But for datasets with thousands of records and thousands of features it is time consuming to train the model in BOW and TFIDF to decide which method is better. is there a way to decide which method to choose without the need to spend all this time?

Topic: bag-of-words tfidf nlp

Category: Data Science

Which phrase should be returned in case of multiple matches when comparing text?

Hefaz

2020年12月23日 15:01

I want to compare one sentence to some other sentences using the Bag of Words model. Suppose that my comparing sentence is: I am playing football and there are three more sentences that I want to compare my comparing sentence with. They are: 1. and I am playing Cricket 2. Why do you play Cricket 3. I love playing Cricket when I am at school Now, if I compare my comparing sentence to the above three sentences by counting words, …

Topic: bag-of-words text text-mining

Category: Data Science

Bag-of-words and Spam classifiers

user529295

2020年12月17日 05:56

I implemented a spam classifier using Bernoulli Naive Bayes, Logistic Regression, and SVM. Algorithms are trained on the entire Enron spam emails dataset using the Bag-of-words (BoW) approach. Prediction is done on the UCI SMS Spam Collection dataset. I have 3 questions: During test time, while creating the term-frequency matrix, what if none of the words from my training BoW are found in some of my test emails/smses. Then, wouldn't the document vectors be zero vectors for those datapoints. How …

Topic: bag-of-words naive-bayes-classifier deep-learning nlp machine-learning

Category: Data Science

Why I would use TF-IDF after Bag-of-Words (CountVectorizer)?

Tiago Bachiega de Almeida

2020年11月20日 17:50

In my recent studies over Machine Learning NLP tasks I found this very nice tutorial teaching how to build your first text classifier: https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a The point is that I always believed that you have to choose between using Bag-of-Words or WordEmbeddings or TF-IDF, but in this tutorial the author uses Bag-of-Words (CountVectorizer) and then uses TF-IDF over the features generated by Bag-of-Words. text_clf = Pipeline([('vect', CountVectorizer()), ... ('tfidf', TfidfTransformer()), ... ('clf', MultinomialNB()), ... ]) Is that a valid technique? Why …

Topic: bag-of-words tfidf nlp

Category: Data Science

One-hot vector for fixed vocabulary

Mi.

2020年11月17日 08:33

given a vocabulary with $|V|=4$ and V = {I, want, this, cat} for example. How does the bag-of-words representation with this vocabulary and one-hot encoding look like regarding example sentences: You are the dog here I am fifty Cat cat cat I suppose it would look like this $V_1 = \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$ $V_2 = \begin{pmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \end{pmatrix}$ $V_3=\begin{pmatrix} 0 \\ 0 \\ 0 \\ 1 …

Topic: bag-of-words one-hot-encoding

Category: Data Science

Machine learning algorithms for forming Homophones from input dataset word

Prashant Akerkar

2020年9月15日 16:30

https://www.google.com/search?sxsrf=ALeKk01_SgA8G4UfNm4rOqku4yJBFvKhLw%3A1600154854621&source=hp&ei=5mxgX8ztI6KZ4-EPq-mL8Ak&q=homophones+example&oq=Homophones&gs_lcp=ChFtb2JpbGUtZ3dzLXdpei1ocBABGAEyBQgAELEDMgUIABCxAzICCAAyCAgAELEDEIMBMgUIABCxAzICCAAyAggAMgUIABCxAzoHCCMQ6gIQJzoECCMQJzoFCAAQkQI6CAguELEDEIMBOgUILhCxA1DkKliKSGDuUGgBcAB4AIAB6wGIAe8NkgEFMC44LjKYAQCgAQGwAQ8&sclient=mobile-gws-wiz-hp Are there Machine learning algorithms for forming Homophones from input dataset word? Homophones examples : accessary, accessory. ad, add. air, heir. all, awl. allowed, aloud. alms, arms. Input : ad Output : ad, add Are there Machine learning algorithms for forming Homophones from input dataset word taking Indian regional languages viz Hindi, Gujarati, Bengali etc and other languages viz French, German, Italian, Spanish, Dutch etc?

Topic: bag-of-words word

Category: Data Science

Machine learning algorithms for correct words formation from jumbled words

Prashant Akerkar

2020年9月14日 17:04

https://www.google.com/search?q=jumbled+words&oq=jumbled&aqs=chrome.1.69i57j0l4.3399j0j9&client=ms-android-lava&sourceid=chrome-mobile&ie=UTF-8 Can Machine learning algorithms solve the input dataset of jumbled words and form the correct words from them?

Topic: bag-of-words word

Category: Data Science

How to process the hyphenated english words for any nlp problem?

emily

2020年9月10日 12:25

Im doing preprocessing on english text dataset. I encounter hyphenated words like 'well-known'. Will it be useful if I remove the hyphen as special character and treat it as a single word 'wellknown' or separate the word into 2 'well' and 'known' or use all 3 words 'well' , 'known', 'wellknown' in vector creation(BOW/TF-IDF) process for model input. Any quick help on this would be more appreciated. Thank you.

Topic: bag-of-words tokenization tfidf preprocessing nlp

Category: Data Science

Word representation that gives more weight to terms frequent in corpus?

Borut Flis

2020年8月23日 21:05

The tf-idf discounts the words that appear in a lot of documents in the corpus. I am constructing an anomaly detection text classification algorithm that is trained only on valid documents. Later I use One-class SVM to detect outliers. Interesting enough the tf-idf performs worse than a simple count-vectorizer. First I was confused, but later it made sense to me, as tf-idf discounts attributes that are most indicative of a valid document. Therefore I was thinking of a new approach …

Topic: bag-of-words tfidf anomaly-detection outlier nlp

Category: Data Science

Bag of words: Prediction on new (out-of-sample) data

Peter

2020年6月29日 12:13

I'm working with a bag of words in R: library(tm) corpus = VCorpus(textsource) dtm = DocumentTermMatrix(corpus) dtm = as.matrix(dtm) I use the matrix dtm to train a lasso model. Now I want to predict new (unseen) text. The problem is, that I need to generate a new dtm (for prediction) with the same matrix columns as the original dtm used for model training. Essentially, I need to populate the original dtm (as used for training) with new text. Example: "original …

Topic: document-term-matrix bag-of-words text-classification r

Category: Data Science

About