How to remove irrelevant text data from a large dataset

Question

How to remove irrelevant text data from a large dataset

zxcisnoias

2022年4月5日 18:04

I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers)?
Or is it ok to keep them in the dataset since they only count for 1-5% ?

Topic text nlp data-cleaning machine-learning

Category Data Science

Abhishek Verma · Accepted Answer · 2021年2月26日 23:24

1

Abhishek Verma answered at 2021年2月26日 23:24

You can use BERT to create vectors that will capture the context of the whole tweet. Once, you do that, try clustering (K-Means or GMM). You can then look at the clusters found and separate out this unwanted data.

How to remove irrelevant text data from a large dataset

About