What is the shape of the vector after it passes through the TfidfVecorizer fit_transform() method?

I am trying to understand what happens inside the IDF part of the TFIDF vectorizer. The official scikit-learn page says that the shape is (4,9) for a corpus of 4 documents having 9 unique features. I get the Term Frequency (TF) part, it makes sense to me that ( for every unique feature(9), for each document(4) we calculate each term's frequency, so we get a matrix of shape (4,9) But what does not make sense to me is the IDF …
Category: Data Science

How to predict the sentiment of the entities form the tweet?

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …
Category: Data Science

Class token in ViT and BERT

I'm trying to understand the architecture of the ViT Paper, and noticed they use a CLASS token like in BERT. To the best of my understanding this token is used to gather knowledge of the entire class, and is then solely used to predict the class of the image. My question is — why does this token exist as input in all the transformer blocks and is treated the same as the word / patches tokens? Treating the class token …
Category: Data Science

How to calculate lexical cohension and semantic informaticveness for a given dataset?

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …
Category: Data Science

Train a spaCy model for semantic similarity

I'm attempting to train a spaCy model for the purposes of computing semantic similarity but I'm not getting the results I would anticipate. I have created two text files that contain many sentences that use a new term, "PROJ123456". For example, "PROJ123456 is on track." I've added each to a DocBin and saved them to disk as train.spacy and dev.spacy. I'm then running: python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy The config.cfg file contains: [paths] train …
Category: Data Science

What to do if your adversarial validation show different distributions for an NLP problem?

I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far: # drop the target column from the training data …
Category: Data Science

Naives Bayes Text Classifier Confidence Score

I am experimenting with building a text classifier using Naive Bayes which has been pretty successful on my test data. One thing i am looking to incorporate is handling text that does not fit into any predefined category that I trained the model on. Does anyone have some thoughts on how to do this? I was thinking of trying to calculate the confidence score for each document, and if < 80 % confidence, for example, it should label the data …
Category: Data Science

Dataset with Multiple Choice Questions for fine tuning

I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …
Category: Data Science

How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

I am trying to predict the number of likes an article or a post will get using a NN. I have a dataframe with ~70,000 rows and 2 columns: "text" (predictor - strings of text) and "likes" (target - continuous int variable). I've been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like. Here is what I did so far: Text cleaning: removing …
Category: Data Science

Using BERT for co-reference resolving, what's the loss function?

I'm working my way around using BERT for co-reference resolving. I'm following this highly-cited paper BERT for Coreference Resolution: Baselines and Analysis (https://arxiv.org/pdf/1908.09091.pdf). I have following questions, the details can't be found easily from the paper, hope you guys help me out. What’s the input? is it antecedents + parapraph? What’s the output? clusters <mention, antecedent> ? More importantly What’s the loss function? For comparison, in another highly-cited paper by [Clark .et al] using Reinforcement Learning, it's very clear about …
Topic: bert nlp
Category: Data Science

Vectorize One line text data

How to vectorize one-line text data? I have used tf-idf including bigrams and trigrams but I am not able to get good results. I have purchase order descriptions which are one-liners and I need to classify. It is a multi-class imbalanced data and I have a small dataset to train around 700 PO descriptions. The number of classes is 7 and the class distribution is similar to exponential. One class is dominating. My take is that TF IDF should not …
Topic: tfidf nlp
Category: Data Science

How do I test one-shot model preformance against flawed categories?

I'm in the process of reworking the ASAM database. Excerpted, it looks like this: 4155 PIRATES BULK CARRIER GULF OF ADEN: Bulk carrier fired upon 3 Aug 09 at 1500 UTC while underway in position 13-46.5N 050-42.3E. Ten heavily armed pirates in two boats fired upon the vessel underway. The pirates failed to board the vessel due to evasive action taken by the master. All crew and ship properties are safe (IMB). 4156 PIRATES CARGO SHIP NIGERIA: Vessel (SATURNAS) boarded, …
Category: Data Science

How to perform Grid Search on NLP CRF model

I am trying to perform hyperparameter tuning on sklearn_crfsuite.CRF model. When I try to execute below code, it doesn't give any exception but it probably fails to perform fit. And due to which, if I try to get best estimator from grid search, it doesn't work. %%time # define fixed parameters and parameters to search crf = sklearn_crfsuite.CRF( algorithm='lbfgs', max_iterations=100, all_possible_transitions=True ) params_space = { "c1": [0,0.05,0.1, 0.25,0.5,1], "c2": [0,0.05,0.1, 0.25,0.5,1] } # use the same metric for evaluation f1_scorer …
Category: Data Science

Using BERT instead of word2vec to extract most similar words to a given word

I am fairly new to BERT, and I am willing to test two approaches to get "the most similar words" to a given word to use in Snorkel labeling functions for weak supervision. Fist approach was to use word2vec with pre-trained word embedding of "word2vec-google-news-300" to find the most similar words @labeling_function() def lf_find_good_synonyms(x): good_synonyms = word_vectors.most_similar("good", topn=25) ##Similar words are extracted here good_list = syn_list(good_synonyms) ##syn_list just returns the stemmed similar word return POSITIVE if any(word in x.stemmed for …
Category: Data Science

how to train custom word2vec embeddings to find related articles?

I am beginner in machine learning. My project is to make search engine based on AI which shows related articles when we search on website. For this i decided to train my own embedding. I found two methods for this: One is to train network to find next word( i.e inputs=[the quick,the quick brown,the quick brown fox] and outputs=[brown, fox,lazy] Other method is to train with nearest words(i.e [brown,fox],[brown,quick],[brown,quick]). Which method should i use and after training how should i …
Category: Data Science

NLP Deep Learning Project (or Paper)

Kind of trying to find a good repos/course/paper or something that can get me up to speed for an NLP problematic. Exemple, Classify email or something else. Also that is relevant to latest state of art (transformers, multi heads etc). Thank you
Category: Data Science

Inference from text data without label or Target

I have a use case where I have text data entered by an approver while approving of some loan. I have to make some inferences as to what could be the reasons for approval using NLP. How should I go about it? It's a Non english language. Can Clustering of text help?? Is it possible to cluster TEXT OF non English language using python libraries.
Category: Data Science

How are natural language generation algorithms given a target

I've started learning about NLP and NLG and I'm fascinated! I've been blown away by the things I've seen from NLP; but I have a few questions about NLG. All my questions boil down to this: Given a network or Markov chain how does one specify what you want the system to talk about? To explain this a little; if I ask my 5 year old nephew to tell me something he'll talk about his toys, or what's on TV …
Topic: nlg nlp
Category: Data Science

How to deal with name strings in large data sets for ML?

My data set contains multiple columns with first name, last name, etc. I want to use a classifier model such as Isolation Forest later. Some word embedding techniques were used for longer text sequences preferably, not for single-word strings as in this case. So I think these techniques wouldn't be the way that will work correctly. Additionally Label encoding or Label binarization may not be suitable ways to work with names, beacause of many different values on the on side …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.