Natural Language gender classification task with very small training set
The task involving determining the gender of the creator of a Reddit post. Given a post and its title, I need a model to output a probability vector $[p_{male},p_{female}]$.
The difficulty here is that the training set is very small: we have only labeled 5000 posts. In addition, the average length of sentence exceed 90, making it hard to extract features.
Currently, we are using non-deep learning methods to perform this task because of the small size of dataset: use tf-idf to extract features and regression to generate output.
However, the performance is not good and I wonder if we can use improve the performace by using NN-based feature extraction, like using pretrained encoders to extract features and only train the regression model.
Category Data Science