How to predict the sentiment of the entities form the tweet?

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …
Category: Data Science

Trying to compress text with NLP

For a university project, I need to send text in Spanish via SMS. As these have a cost, I am trying to compress this text in an inefficient way. This consists of first generating a permutation of codes formed by two characters of many alphabets (fines, Cyrillic, etc.) to which I assign a word that has more than two characters (to say that it is being compressed). Then I take each word in a sentence and assign it its associated …
Category: Data Science

In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?

When learning Add-1 smoothing, I found that somehow we are adding 1 to each word in our vocabulary, but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me give an example to explain. Example: Assume we have a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher". After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of …
Category: Data Science

Lemmatization Vs Stemming

I have been reading about both these techniques to find the root of the word, but how do we prefer one to the other? Is "Lemmatization" always better than "Stemming"?
Category: Data Science

GloVe dot product optimized for non-comutative data whilst the operation itself being commutative

To my current knowledge, GloVe word vectors dot product are optimized to be the w_i ⋅ w_j = log⁡(P(ⅈ|j)) The probability being computed from a cooccurance matrix. However, dot product is a commutative operation, whilst the log probablity isn't. Is this issue being adressed in GloVe? Am I missing something?
Category: Data Science

Extracting information with corresponding fields

I have large pool of scanned county documents. I need to extract information like document title, borrower name&address, lender name&address etc. The text is like this Eg: the deed of trust, between abc llc, a limited company, whose address is XXXXXX, herein called "borrower", and xyz, whose address is XXXXX,herein called "lender". I used Named entity recognition method to extract the names, it works well. but how would i know which name is borrower and which one is lender? can …
Category: Data Science

How can I extract the reason of the legal compensation from a court report?

I'm working on a project (court-related). At a certain point, I have to extract the reason of the legal compensation. For instance, let's take these sentences (from a court report) Order mister X to pay EUR 5000 for compensation for unpaid wages and To cover damages, mister X must pay EUR 4000 to mister Y I want to make an algorithm that is able from this sentence to extract the motive of legal compensation. For the first sentence Order mister …
Category: Data Science

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors. from transformers import AutoTokenizer, AutoModel # download and load model tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") …
Category: Data Science

Model Overfitting in text classification how to solve?

This is my CNN model i am doing text classification on mental health social media data. the model is overfitting as validation loss is much greater than training loss. There are three columns(Text, Title,label) and 7 classes in dataset depression 256140 Anxiety 85916 bipolar 41262 mentalhealth 39161 BPD 37996 schizophrenia 17388 autism 7110 I am providing my model and its history. For this particular issue i need, interpretation of how to interpret the model and a solution. Here is my …
Category: Data Science

Reversing a dependency tree into the original sentence

I'm wondering if it is possible to convert a dependency parser such as (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) into "My dog also likes eating sausage." with Standford CoreNLP or otherwise
Category: Data Science

Closed Domain Question Answering which doesn't answer Questions

I've been exploring Closed Domain Question Answering Implementations which have been trained on SQuAD 2.0 dataset. Ideally, it should not answer questions which the context text corpus doesn't contain answers to. But while implementing such models using the Haystack repo or the FARM repo, I'm finding that it always answers these questions even when it shouldn't. Is there any implementation available that takes into account the fact that it shouldn't answer questions when it doesn't find a suitable answer. References: …
Category: Data Science

How to extract numerical information from text descriptions

I have an attribute that is the description of an operation (i.e description of a building consent), I need to translate this to a mathematical operation. I need to find out the new number of dwelling that is going to build, and I have to ignore any other operation. I am not sure how to tackle this problem. I can do Regex, and do lots of searches but there should be a smarter way (is there???) by using machine learning/text …
Category: Data Science

How does Stanford CRF encode NER string features?

Most features created by the NERFeatureFactory are strings e.g. from usePrev, useNext, useNGrams etc. From my understanding, that's too many tokens to fit in a dictionary or to use embeddings. I don't see how the UNKNOWN embedding would bring any value given that most features are not known words. I've been looking at the code on Github but haven't figured it out yet. I love New York! > love > love-I-W-PW, love-New-W-NW, #lo#, #ov#, #ve# etc
Category: Data Science

How do I go for NLP based on phrases instead of sentences?

I have a list of words in this format: chem, chemistry chemi, chemistry chm, chemistry chmstry, chemistry Here, the first column represents the actual word which is in the second column. I need to apply NLP (in python3) so that when the model is trained using this dataset and I give 'chmty' as input, it will give 'chemistry' as output. I don't want string similarity techniques, I want to build an NLP model.
Category: Data Science

How to stay up to date in NLP and use the best approaches?

There are many fast advancements in NLP field, BERT, RoBERTa, ALBERT, and XLNe, and no one can check the news or papers daily. Is there any way or site that keeps track of all these new developments and possibly provides a link to the code? For example, if someone needs to use text summarization, the suggested approach would be X, and so on.
Category: Data Science

Adding additional classes in stanford NLP NER or Spacy

For stanford NER 3 class model, Location, Person, Organization recognizers are available. Is it possible to add additional classes to this model. For example : Sports as one class to tag sports names. or if not, is there any model where i can add additional classes. Note: I didnt exactly mean to add "sports" as a class. I was wondering is there a possibility to add a custom class in that model. If not possible in stanford, is it possible …
Category: Data Science

Entity Recognition in Stanford NLP using Python

I am using Stanford Core NLP using Python. I have taken the code from here. This is the code: from stanfordcorenlp import StanfordCoreNLP import logging import json class StanfordNLP: def __init__(self, host='http://localhost', port=9000): self.nlp = StanfordCoreNLP(host, port=port, timeout=30000 , quiet=True, logging_level=logging.DEBUG) self.props = { 'annotators': 'tokenize,ssplit,pos,lemma,ner,parse,depparse,dcoref,relation,sentiment', 'pipelineLanguage': 'en', 'outputFormat': 'json' } def word_tokenize(self, sentence): return self.nlp.word_tokenize(sentence) def pos(self, sentence): return self.nlp.pos_tag(sentence) def ner(self, sentence): return self.nlp.ner(sentence) def parse(self, sentence): return self.nlp.parse(sentence) def dependency_parse(self, sentence): return self.nlp.dependency_parse(sentence) def annotate(self, sentence): return …
Category: Data Science

Restrict Date parser in certain cases

Sorry if the title wasn't self-explanatory. Here is a detailed version. I created a data parser to parse dates from resumes. The ultimate goal is to find how many years of work experience a candidate has "based on the resume." The parser can catch dates in all formats like: MM/DD/YY - MM/DD/YY MM/DD/YYYY - MM/DD/YYYY Apr 09 - Jul 11 03/09 - 07/11 2007 - 2010 etc. The way in which the parser works is it first extracts all the …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.