How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

Question

How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

Najati Al-imam

2022年6月3日 18:03

I am trying to predict the number of likes an article or a post will get using a NN.

I have a dataframe with ~70,000 rows and 2 columns: text (predictor - strings of text) and likes (target - continuous int variable). I've been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like.

Here is what I did so far:

Text cleaning: removing html tags, stop words, punctuation, etc...
Lower-casing the text column
Tokenization
Lemmatization
Stemming

I've assigned the results to a new column , so now I have clean_text column with all the above applied to it. However, I'm not sure how to proceed.

In most NLP problems, I have noticed that people use word embeddings, but from what I have understood, it's a method used when attempting to predict the next word in a text. Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.

In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode. Even then, I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created clean_text varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.

Am I approaching this completely wrong? and what should I do?

Topic deep-learning neural-network nlp machine-learning

Category Data Science

luciofaso · Accepted Answer · 2022年5月3日 14:35

A key issue in NLP is to encode text into a numerical representation. Embeddings are used for this purpose.

word embeddings [...] is a method used when attempting to predict the next word in a text

Not really. Embedding is a transformation of sparse data into a dense space. This is often used in NLP to take into account similarity among words. Predicting the next word (from previous words and other information) is a language model.

how can [embeddings] be used to derive the weight/impact of each word on the target variable in my case

Think about one alternative way to represent text numerically: the one hot encoding, where each word is a long vector (the size of your dictionary) made of zero values everywhere except at the index representing that word. The embedding representation, being dense, is much more informative: consider the case of "similar" words, (or more specifically, words used in a similar context): an embedding will consider "dog" and "wolf" as similar vectors, whereas in one hot encoding they will be considered as independent, as much as "dog" and "independence" (just dummy examples).

Leevo · Accepted Answer · 2020年8月3日 19:33

from what I have understood, it's a method used when attempting to predict the next word in a text.

Not really, Word2vec is a technique to represent the (relative) meaning of words in a way that can be fed into a ML model. You can use them as language models, i.e. for the prediction of the next word in a sequence, but that's just one of the possible uses of it. You can train a model with word embeddings for whatever other task. Word2vec is suitable in this case.

Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.

I don't know your problem well enough, but I'd say you need additional information. For example: informations on the account such as number of followers, likes, shares/retweets, you name it.

In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode.

Gensim requires a list of words/tokens (in the form of strings), and the word2vec model will take care of anything for you. You don't have to manually one-hot encode all the words. Don't worry about that.

I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created "clean_text" varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.

I don't really know what you did here, but I'm pretty sure it's not correct. You don't have to manually one-hot encode anything. More importantly, one-hot encoding rows doesn't make sense, why would you do that?

How to use text as an input for a neural network - regression problem? How many likes/claps an article will get

About