How to have a fixed no of features for input layer of a neural network when using TF-IDF

So basically my question is hypothetically lets say:

I have a column containing 2000 rows of texts, and when I apply tf-idf, I get 27 features like shown below.

Now once I do that, I could consider my Neural Network's Input layer's number of neurons to be 27, like shown below, and i train the model with the tf-idf features.

Now, hypothetically speaking, if I'm trying to test this model with one string (a short string), and when we apply Tf-Idf to that string text, we get 20 features. Now this number of features does not equal to the no of neurons in my input layer, which is 27, which can cause problems.

How can we tackle a problem like this, I have seen we can use max no of features with Tf-Idf, but i thought it would be good to ask the community as well, so that you could show me a better way.

Is there a way we could have a fixed length of features when applying Tf-Idf so that there wont be a problem when feeding to the neural network.

Your help will be appreciated!

Topic tfidf deep-learning neural-network nlp machine-learning

Category Data Science


If you want to apply the same step on new data you would call the transform method of the tf-idf vectorizer, which should give you the same number of features. See the example below:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
text = [
    "This is a test string.",
    "This is another test string."
]
print("Shape of the transformed training sentences:", vectorizer.fit_transform(text).shape)
print("Shape of the new string to test the model on:", vectorizer.transform(["A string to test the model on."]).shape)

# Shape of the transformed training sentences: (2, 5)
# Shape of the new string to test the model on: (1, 5)

As you can see, the number of features (columns) stays the same (5) when applying the trained vectorizer on new data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.