How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

I am trying to predict the number of likes an article or a post will get using a NN.

I have a dataframe with ~70,000 rows and 2 columns: text (predictor - strings of text) and likes (target - continuous int variable). I've been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like.

Here is what I did so far:

  1. Text cleaning: removing html tags, stop words, punctuation, etc...
  2. Lower-casing the text column
  3. Tokenization
  4. Lemmatization
  5. Stemming

I've assigned the results to a new column , so now I have clean_text column with all the above applied to it. However, I'm not sure how to proceed.

In most NLP problems, I have noticed that people use word embeddings, but from what I have understood, it's a method used when attempting to predict the next word in a text. Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.

In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode. Even then, I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created clean_text varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.

Am I approaching this completely wrong? and what should I do?

Topic deep-learning neural-network nlp machine-learning

Category Data Science


A key issue in NLP is to encode text into a numerical representation. Embeddings are used for this purpose.

word embeddings [...] is a method used when attempting to predict the next word in a text

Not really. Embedding is a transformation of sparse data into a dense space. This is often used in NLP to take into account similarity among words. Predicting the next word (from previous words and other information) is a language model.

how can [embeddings] be used to derive the weight/impact of each word on the target variable in my case

Think about one alternative way to represent text numerically: the one hot encoding, where each word is a long vector (the size of your dictionary) made of zero values everywhere except at the index representing that word. The embedding representation, being dense, is much more informative: consider the case of "similar" words, (or more specifically, words used in a similar context): an embedding will consider "dog" and "wolf" as similar vectors, whereas in one hot encoding they will be considered as independent, as much as "dog" and "independence" (just dummy examples).


from what I have understood, it's a method used when attempting to predict the next word in a text.

Not really, Word2vec is a technique to represent the (relative) meaning of words in a way that can be fed into a ML model. You can use them as language models, i.e. for the prediction of the next word in a sequence, but that's just one of the possible uses of it. You can train a model with word embeddings for whatever other task. Word2vec is suitable in this case.


Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.

I don't know your problem well enough, but I'd say you need additional information. For example: informations on the account such as number of followers, likes, shares/retweets, you name it.


In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode.

Gensim requires a list of words/tokens (in the form of strings), and the word2vec model will take care of anything for you. You don't have to manually one-hot encode all the words. Don't worry about that.


I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created "clean_text" varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.

I don't really know what you did here, but I'm pretty sure it's not correct. You don't have to manually one-hot encode anything. More importantly, one-hot encoding rows doesn't make sense, why would you do that?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.