How to predict the sentiment of the entities form the tweet?

I have a JSON file (tweets.json) that contains tweets (sentences) along with the name of the author. Objective 1: Get the most frequent entities from the tweets. Objective 2: Find out the sentiment/polarity of each author towards each of the entities. Sample Input: Assume we have only 3 tweets: Tweet1 by Author1: Pink Pearl Apples are tasty but Empire Apples are not. Tweet2 by Author2: Empire Apples are very tasty. Tweet3 by Author3: Pink Pearl Apples are not tasty. Sample …
Category: Data Science

When using padding in sequence models, is Keras validation accuracy valid/ reliable?

I have a group of non zero sequences with different lengths and I am using Keras LSTM to model these sequences. I use Keras Tokenizer to tokenize (tokens start from 1). In order to make sequences have the same lengths, I use padding. An example of padding: # [0,0,0,0,0,10,3] # [0,0,0,0,10,3,4] # [0,0,0,10,3,4,5] # [10,3,4,5,6,9,8] In order to evaluate if the model is able to generalize, I use a validation set with 70/30 ratio. In the end of each epoch …
Category: Data Science

Optimal input setup for character-level text classification RNN

I want to classify 500-character long text samples as to whether they look like natural language using a character-level RNN. I'm unsure as to the best way to feed the input to the RNN. Here are two approaches I've thought of: Provide the whole 500 characters (one per time step) to the RNN, and predict a binary class, $\{0,1\}$. Provide shorter overlapping segments (e.g. 10 characters) and predict the next (e.g. 11th) character. Convert this to classification by taking the …
Category: Data Science

Clarification on "predict the next character given the previous 100 characters"

I am studying Justin Johnson's lecture on RNNs Lecture recording: https://www.youtube.com/watch?v=dUzLD91Sj-o&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=12&t=3177s One of the examples is character level language modeling: predicting the next character given the previous characters. At 33:03 in the video linked above, Justin discusses training an RNN that processes the works of William Shakespeare and tries to predict the next character given the previous 100 characters. What does given the previous 100 characters mean? In the lecture slides Slides link: https://web.eecs.umich.edu/~justincj/slides/eecs498/498_FA2019_lecture12.pdf there are the following figures: It …
Category: Data Science

Transfer learning between Language Model and classification

Following this fast.ai lecture, I am trying to understand the mechanism of Transfer Learning in NLP from a general Language Model (LM) to a classification problem. What is exactly taken from the Language Model training? Is it just the word embeddings? Or is it also the weights of the LSTM cell? The architecture of the neural net should be quite different - where in a LM you would output a prediction after every sequence-step, in a classification problem you would …
Category: Data Science

In smoothing of n-gram model in NLP, why don't we consider start and end of sentence tokens?

When learning Add-1 smoothing, I found that somehow we are adding 1 to each word in our vocabulary, but not considering start-of-sentence and end-of-sentence as two words in the vocabulary. Let me give an example to explain. Example: Assume we have a corpus of three sentences: "John read Moby Dick", "Mary read a different book", and "She read a book by Cher". After training our bi-gram model on this corpus of three sentences, we need to evaluate the probability of …
Category: Data Science

Importance of Random initialisation VS number of hidden units

A question crossed my mind not so long ago: I am doing experiments on Language Model with RNN (always with the same network topology: 50 hidden units, and 10M "directs connections" that are emulating N_grams models) and different fraction of corpus (10,25,50,75,100%) (9M words). I noticed that while perplexity seems to decrease when the training data become more abundant, certain times it does not. Last example : 143 118 109 106 112 My first thought was network initialization, so I …
Category: Data Science

how to improve my imbalanced data NLP model?

I want to classify a patient's health as a prediction probability and get the top 10 most ill patients in a hospital. I have patient's condition notes, medical notes, diagnoses notes, and lab notes for each day. Current approach - vectorize all the notes using spacy's scispacy model and sum all the vectors grouped by patient id and day. (200 columns) find the unit vectors of the above vectors. (200 columns) use a moving average function on the vectors grouped …
Category: Data Science

Question about computing language modeling loss with multi gpu

When training BERT or GPT or other language model, we use the mean of cross entropy as loss function(don't consider label smoothing). Here B denote for batch size, len denote target length of i-th sentence sequence. $$L = \frac{\sum_{i=0}^{|B|}\sum_{j=0}^{len_i}ce(y_{ij},\hat{y}_{ij})}{\sum_{i=0}^{|B|}{len_i}} .......(1)$$ When use multi gpu, the common forward process is: Split data to each gpu; Compute loss on each gpu; Reduce loss;(Most of time we simply use the mean of loss from all gpus) Now if we combine those above together, …
Category: Data Science

State-of-the-art Python packages that can evaluate language similarity

I am trying to evaluate the likelihood of generating a specific sentence out of a large set of sentences. To do this, I start from a simple approach: training a custom n-gram language model and calculating the perplexity values for a list of sentences. I found that the package KenLM (https://www.aclweb.org/anthology/W11-2123/) was often used to do this task. However, it's kind of old (published in 2011). On the other hand, I noticed that the two most famous state-of-the-art NLP packages, …
Category: Data Science

The differences between BNF and JSGF in NLP?

I wonder what the differences are between the BNF(Backus-Naur Form) and JSGF(Java Speech Grammar Format)? The former is a kind of context-free grammar taught in CS224, but I learned that the latter is also being used. Could anyone tell me which one is better and what are their differences?
Category: Data Science

A multi label text classification problem

I'm looking to solve a multi label text classification problem but I don't really know how to formulate it correctly so I can look it up.. Here is my problem : Say I have the document "I want to learn NLP. I can do that by reading NLP books or watching tutorials on the internet. That would help me find a job in NLP." I want to classify the sentences into 3 labels (for example) objective, method and result. The …
Category: Data Science

Can Domain-Adaption improve the performance of Sentiment Analysis?

Does Domain Adaption have any effect of results in Sentiment Analysis? I am going to train a BERT language model based on some texts particularly in Health area, then I want to apply Opinion Mining on that to find which text carries positive or negative sentiment. I have run it on pre-trained BERT and get some results, however my question is that, does Domain Adaption help me to increase the performance of my model?
Category: Data Science

A simple attention based text prediction model from scratch using pytorch

I first asked this question in codereview SE but a user recommended to post this here instead. I have created a simple self attention based text prediction model using pytorch. The attention formula used for creating attention layer is, I want to validate whether the whole code is implemented correctly, particularly my custom implementation of Attention layer. Full code import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F import random random.seed(0) torch.manual_seed(0) # Sample text …
Category: Data Science

Understanding Kneser-Ney Formula for implementation

I am trying to implement this formula in Python $$ \frac{\text{max}(c_{KN}(w^{i}_{i-n+1} - d), 0)}{c_{KN}(w^{i-1}_{i-n+1})} + \lambda(c_{KN}(w^{i-1}_{i-n+1})\mathbb{P}(c_{KN}(w_{i}|w^{i-1}_{i-n+2})$$ where $$ \mathrm{c_{KN}}(\cdot) = \begin{cases} \text{count}(\cdot) & \text{for the highest order } \\ % & is your "\tab"-like command (it's a tab alignment character) \text{continuationcount}(\cdot) & \text{otherwise.} \end{cases} $$ Following this link here I was able to understand how to implement the first half of the equation namely $$\frac{\text{max}(c_{KN}(w^{i}_{i-n+1} - d), 0)}{c_{KN}(w^{i-1}_{i-n+1})} $$ but the second half specifically at the moment the $\lambda(c_{KN}(w^{i-1}_{i-n+1})$ term …
Category: Data Science

Transformer model comparison for binary sentiment classification

On two independent datasets, I am comparing XLNet and BERT models with binary sentiment classification tasks: the Twitter dataset, where sentences are short, and the IMDB review dataset, where sentences are long. On the Twitter dataset, BERT matches and slightly outperforms XLNet, but XLNet outperforms BERT on the IMDB dataset. I understand that XLNet captures longer dependencies due to the Transformer XL architecture and so outperforms BERT; but, what additional reasons may exist for one to outperform the other for …
Category: Data Science

What is the difference between model hyperparameters and model parameters?

I have noticed that such terms as model hyperparameter and model parameter have been used interchangeably on the web without prior clarification. I think this is incorrect and needs explanation. Consider a machine learning model, an SVM/NN/NB based classificator or image recognizer, just anything that first springs to mind. What are the hyperparameters and parameters of the model? Give your examples please.
Category: Data Science

Sequence-to-Sequence Transformer for Neural machine translation

I am using the tutorial in Keras documentation here. I am new to deep learning. On a different dataset Menyo-20k dataset, of about 10071 total pairs, 7051 training pairs,1510 validation pairs,1510 test pairs. The highest validation accuracy and test accuracy I have gotten is approximately 0.26. I tried the list of things below: Using the following optimizers: SGD, Adam, RMSprop Tried different learning rate Tried the dropout rate of 0.4 and 0.1 Tried using different embedding dimensions and feed-forward network …
Category: Data Science

How do we pass data to a RNN?

Let's say we have A1, A2, ... , Am different articles in the corpus and each of them has W1, W2, ....., Ww words. We are training a language model on them. Do we: Scheme 1 Take the first batch of data as first S (Number of time steps) (S1, S2, .., Ss)words from each article (for the sake of simplicity let us assume batch size = m) Set the initial hidden state H0 = $[0,0,..,0]$ Calculate loss and gradient …
Category: Data Science

Why not rule-based semantic role labelling?

I have recently found some interest in automatic semantic role labelling. Most introductory texts (e.g. Jurafsky and Martin, 2008) present approaches based on supervised machine learning, often using FrameNet (Baker et al. 1998) and PropBank (Kingsbury & Palmer, 2002). Intuitively however, I would imagine that the same problem could be tackled with a grammar-based parser. Why is this not the case? Or rather, why would these supervised solutions be preferred? Thanks in advance. References Jurafsky, D., & Martin, J. H. …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.