Is it a best practice to exclude retweets from the data set?

Question

Is it a best practice to exclude retweets from the data set?

user84037

2022年5月30日 21:06

I am going to build machine learning algorithm to identify fake tweets. The data set has huge retweets which I think might be an issue. Do you think given that the focus is the original tweet, it is better to remove all the retweets?

Thank you,

Topic supervised-learning pandas python machine-learning

Category Data Science

Adept · Accepted Answer · 2020年8月10日 09:40

To me it depends on what you want to focus on : do you want to create a model dealing with original posts that are fake news, and then make an algorithm finding the original from a retweet then applying your model ? Or do you just want a model that takes one tweet, not looking if it's a retweet or not, and trying to guess if it's fake or not.

In the first case, you should remove them, because you'll have many information about the people retweeting fake news, while you only want to find info about origin posters, which will make your model biaised. In the second case, of course, since that's exactly what your model aims to do, you should keep them.

Uday T · Accepted Answer · 2019年11月13日 05:45

There might be a chance that the retweet has an entirely different context compared to the original tweet. It is also possible that some retweets with different opinion/comment gain more popularity than the original one.

In these cases I don't think you can classify them as fake tweets.

You can classify tweets as fake when they are widely retweeted but with no context, One such example is retweets due to a giveaway or charity.

If you can figure out how to separate the spam retweets and original tweets it would help for better analysis and accurate results.

Michael Hearn · Accepted Answer · 2019年11月13日 02:32

No. I do not believe so and I can explain a few reasons why.

If an entity wants to create waves in twitter with false tweets retweets are probably apart of the plan.
If you want to detect tweets generated by bots looking at the statistical data on said tweets and retweets like time stamps could be relevant to detecting if the tweet is generated by a bot.
If You have a way of checking retweets by bots then removing all retweets would also remove that data.

You should remove retweets if.

The project is focused on analysis of text to determine if a tweet is bot or not.
There is no labeled human or bot retweet data.

Is it a best practice to exclude retweets from the data set?

About