I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …
For a university project, I need to send text in Spanish via SMS. As these have a cost, I am trying to compress this text in an inefficient way. This consists of first generating a permutation of codes formed by two characters of many alphabets (fines, Cyrillic, etc.) to which I assign a word that has more than two characters (to say that it is being compressed). Then I take each word in a sentence and assign it its associated …
What is sentence embedding? How would you do sentence embedding for a sentence like: "How old are you?" How do you use word embedding to create a sentence embedding?
When learning Add-1 smoothing, I found that somehow we are adding 1 to each word in our vocabulary, but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me give an example to explain. Example: Assume we have a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher". After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of …
I have been reading about both these techniques to find the root of the word, but how do we prefer one to the other? Is "Lemmatization" always better than "Stemming"?
To my current knowledge, GloVe word vectors dot product are optimized to be the w_i ⋅ w_j = log(P(ⅈ|j)) The probability being computed from a cooccurance matrix. However, dot product is a commutative operation, whilst the log probablity isn't. Is this issue being adressed in GloVe? Am I missing something?
I have large pool of scanned county documents. I need to extract information like document title, borrower name&address, lender name&address etc. The text is like this Eg: the deed of trust, between abc llc, a limited company, whose address is XXXXXX, herein called "borrower", and xyz, whose address is XXXXX,herein called "lender". I used Named entity recognition method to extract the names, it works well. but how would i know which name is borrower and which one is lender? can …
I'm working on a project (court-related). At a certain point, I have to extract the reason of the legal compensation. For instance, let's take these sentences (from a court report) Order mister X to pay EUR 5000 for compensation for unpaid wages and To cover damages, mister X must pay EUR 4000 to mister Y I want to make an algorithm that is able from this sentence to extract the motive of legal compensation. For the first sentence Order mister …
When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors. from transformers import AutoTokenizer, AutoModel # download and load model tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") …
This is my CNN model i am doing text classification on mental health social media data. the model is overfitting as validation loss is much greater than training loss. There are three columns(Text, Title,label) and 7 classes in dataset depression 256140 Anxiety 85916 bipolar 41262 mentalhealth 39161 BPD 37996 schizophrenia 17388 autism 7110 I am providing my model and its history. For this particular issue i need, interpretation of how to interpret the model and a solution. Here is my …
I'm wondering if it is possible to convert a dependency parser such as (ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) into "My dog also likes eating sausage." with Standford CoreNLP or otherwise
I am current NLP work, I am extracting triples using triple extraction function in Stanford NLP and Spacy libraries. I am looking for a good method to evaluate how good the extraction has been? Any suggestions
I've been exploring Closed Domain Question Answering Implementations which have been trained on SQuAD 2.0 dataset. Ideally, it should not answer questions which the context text corpus doesn't contain answers to. But while implementing such models using the Haystack repo or the FARM repo, I'm finding that it always answers these questions even when it shouldn't. Is there any implementation available that takes into account the fact that it shouldn't answer questions when it doesn't find a suitable answer. References: …
I have an attribute that is the description of an operation (i.e description of a building consent), I need to translate this to a mathematical operation. I need to find out the new number of dwelling that is going to build, and I have to ignore any other operation. I am not sure how to tackle this problem. I can do Regex, and do lots of searches but there should be a smarter way (is there???) by using machine learning/text …
Most features created by the NERFeatureFactory are strings e.g. from usePrev, useNext, useNGrams etc. From my understanding, that's too many tokens to fit in a dictionary or to use embeddings. I don't see how the UNKNOWN embedding would bring any value given that most features are not known words. I've been looking at the code on Github but haven't figured it out yet. I love New York! > love > love-I-W-PW, love-New-W-NW, #lo#, #ov#, #ve# etc
I have a list of words in this format: chem, chemistry chemi, chemistry chm, chemistry chmstry, chemistry Here, the first column represents the actual word which is in the second column. I need to apply NLP (in python3) so that when the model is trained using this dataset and I give 'chmty' as input, it will give 'chemistry' as output. I don't want string similarity techniques, I want to build an NLP model.
There are many fast advancements in NLP field, BERT, RoBERTa, ALBERT, and XLNe, and no one can check the news or papers daily. Is there any way or site that keeps track of all these new developments and possibly provides a link to the code? For example, if someone needs to use text summarization, the suggested approach would be X, and so on.
For stanford NER 3 class model, Location, Person, Organization recognizers are available. Is it possible to add additional classes to this model. For example : Sports as one class to tag sports names. or if not, is there any model where i can add additional classes. Note: I didnt exactly mean to add "sports" as a class. I was wondering is there a possibility to add a custom class in that model. If not possible in stanford, is it possible …
Sorry if the title wasn't self-explanatory. Here is a detailed version. I created a data parser to parse dates from resumes. The ultimate goal is to find how many years of work experience a candidate has "based on the resume." The parser can catch dates in all formats like: MM/DD/YY - MM/DD/YY MM/DD/YYYY - MM/DD/YYYY Apr 09 - Jul 11 03/09 - 07/11 2007 - 2010 etc. The way in which the parser works is it first extracts all the …