How to remove irrelevant text data from a large dataset

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers)?
Or is it ok to keep them in the dataset since they only count for 1-5% ?

Topic text nlp data-cleaning machine-learning

Category Data Science


You can use BERT to create vectors that will capture the context of the whole tweet. Once, you do that, try clustering (K-Means or GMM). You can then look at the clusters found and separate out this unwanted data.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.