nlp - Geeks Mental

What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

Allan_Aj5

2022年6月4日 20:02

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …

Topic: scikit-learn nlp

Category: Data Science

How to predict the sentiment of the entities form the tweet?

coding_ninza

2022年6月4日 17:05

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …

Topic: spacy stanford-nlp sentiment-analysis language-model nlp

Category: Data Science

Class token in ViT and BERT

Shir

2022年6月4日 15:02

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens? Treating the class token …

Topic: attention-mechanism computer-vision deep-learning nlp machine-learning

Category: Data Science

How to calculate lexical cohension and semantic informaticveness for a given dataset?

J Cena

2022年6月4日 14:00

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …

Topic: text-mining nlp statistics data-mining

Category: Data Science

Train a spaCy model for semantic similarity

motormal

2022年6月4日 13:04

I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate. I have created two text files that contain many sentences that use a new term, "PROJ123456". For example, "PROJ123456 is on track." I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy. I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The config.cfg file contains: [paths] train …

Topic: semantic-similarity spacy training nlp

Category: Data Science

What to do if your adversarial validation show different distributions for an NLP problem?

dsbr_

2022年6月3日 22:30

I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far: # drop the target column from the training data …

Topic: validation cross-validation nlp machine-learning

Category: Data Science

Naives Bayes Text Classifier Confidence Score

Jim

2022年6月3日 20:06

I am experimenting with building a text classifier using Naive Bayes which has been pretty successful on my test data. One thing i am looking to incorporate is handling text that does not fit into any predefined category that I trained the model on. Does anyone have some thoughts on how to do this? I was thinking of trying to calculate the confidence score for each document, and if < 80 % confidence, for example, it should label the data …

Topic: naive-bayes-classifier nlp python

Category: Data Science

Dataset with Multiple Choice Questions for fine tuning

futuredataengineer

2022年6月3日 19:35

I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …

Topic: openai-gpt data dataset nlp machine-learning

Category: Data Science

How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

Najati Al-imam

2022年6月3日 18:03

I am trying to predict the number of likes an article or a post will get using a NN. I have a dataframe with ~70,000 rows and 2 columns: "text" (predictor - strings of text) and "likes" (target - continuous int variable). I've been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like. Here is what I did so far: Text cleaning: removing …

Topic: deep-learning neural-network nlp machine-learning

Category: Data Science

Using BERT for co-reference resolving, what's the loss function?

EyeQ Tech

2022年6月3日 08:00

I'm working my way around using BERT for co-reference resolving. I'm following this highly-cited paper BERT for Coreference Resolution: Baselines and Analysis (https://arxiv.org/pdf/1908.09091.pdf). I have following questions, the details can't be found easily from the paper, hope you guys help me out. What’s the input? is it antecedents + parapraph? What’s the output? clusters <mention, antecedent> ? More importantly What’s the loss function? For comparison, in another highly-cited paper by [Clark .et al] using Reinforcement Learning, it's very clear about …

Topic: bert nlp

Category: Data Science

Vectorize One line text data

Payal Bhatia

2022年6月3日 03:08

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results. I have purchase order descriptions which are one-liners and I need to classify. It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating. My take is that TF IDF should not …

Topic: tfidf nlp

Category: Data Science

How do I test one-shot model preformance against flawed categories?

hrokr

2022年6月3日 02:41

I'm in the process of reworking the ASAM database. Excerpted, it looks like this: 4155 PIRATES BULK CARRIER GULF OF ADEN: Bulk carrier fired upon 3 Aug 09 at 1500 UTC while underway in position 13-46.5N 050-42.3E. Ten heavily armed pirates in two boats fired upon the vessel underway. The pirates failed to board the vessel due to evasive action taken by the master. All crew and ship properties are safe (IMB). 4156 PIRATES CARGO SHIP NIGERIA: Vessel (SATURNAS) boarded, …

Topic: one-shot-learning accuracy nlp

Category: Data Science

How to perform Grid Search on NLP CRF model

Raj

2022年6月2日 21:01

I am trying to perform hyperparameter tuning on sklearn_crfsuite.CRF model. When I try to execute below code, it doesn't give any exception but it probably fails to perform fit. And due to which, if I try to get best estimator from grid search, it doesn't work. %%time # define fixed parameters and parameters to search crf = sklearn_crfsuite.CRF( algorithm='lbfgs', max_iterations=100, all_possible_transitions=True ) params_space = { "c1": [0,0.05,0.1, 0.25,0.5,1], "c2": [0,0.05,0.1, 0.25,0.5,1] } # use the same metric for evaluation f1_scorer …

Topic: machine-learning-model gridsearchcv nlp

Category: Data Science

Using BERT instead of word2vec to extract most similar words to a given word

Maitha Alnaqbi

2022年6月2日 19:59

I am fairly new to BERT, and I am willing to test two approaches to get "the most similar words" to a given word to use in Snorkel labeling functions for weak supervision. Fist approach was to use word2vec with pre-trained word embedding of "word2vec-google-news-300" to find the most similar words @labeling_function() def lf_find_good_synonyms(x): good_synonyms = word_vectors.most_similar("good", topn=25) ##Similar words are extracted here good_list = syn_list(good_synonyms) ##syn_list just returns the stemmed similar word return POSITIVE if any(word in x.stemmed for …

Topic: snorkel bert word2vec nlp

Category: Data Science

how to train custom word2vec embeddings to find related articles?

Balu

2022年6月2日 19:02

I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding. I found two methods for this: One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy] Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]). Which method should i use and after training how should i …

Topic: embeddings word-embeddings nlp

Category: Data Science

NLP Deep Learning Project (or Paper)

minattosama

2022年6月2日 14:11

Kind of trying to find a good repos/course/paper or something that can get me up to speed for an NLP problematic. Exemple, Classify email or something else. Also that is relevant to latest state of art (transformers, multi heads etc). Thank you

Topic: deep-learning nlp python

Category: Data Science

Automatic topic labelling for topic modelling

shivanshu dhawan

2022年6月2日 13:06

I am just curious to know if there is a way to automatically get the lables for the topics in Topic modelling. It would be really helpful if there's any python implementation of it.

Topic: python-3.x topic-model nlp machine-learning

Category: Data Science

Inference from text data without label or Target

sayan_sen

2022年6月2日 06:04

I have a use case where I have text data entered by an approver while approving of some loan. I have to make some inferences as to what could be the reasons for approval using NLP. How should I go about it? It's a Non english language. Can Clustering of text help?? Is it possible to cluster TEXT OF non English language using python libraries.

Topic: text-mining nlp clustering

Category: Data Science

How are natural language generation algorithms given a target

user6916458

2022年6月2日 03:02

I've started learning about NLP and NLG and I'm fascinated! I've been blown away by the things I've seen from NLP; but I have a few questions about NLG. All my questions boil down to this: Given a network or Markov chain how does one specify what you want the system to talk about? To explain this a little; if I ask my 5 year old nephew to tell me something he'll talk about his toys, or what's on TV …

Topic: nlg nlp

Category: Data Science

How to deal with name strings in large data sets for ML?

Danny Abstemio

2022年6月1日 23:04

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later. Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side …

Topic: preprocessing classifier encoding nlp python

Category: Data Science

About