T-test against normalised or standardised data gives different results
I am studying the problem to predict popularity of a tweet, and want to test null hypothesis: there is no relationships between favorite_counts and another set of variables, like number of friends of users.
I am not sure if normalise or standardise the variables, because I am thinking how to model popularity and don't know how the distributions of likes and friends among users are (please advise).
So I tried the two, and tried an independent t_test.
I get very different results:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
do_scaled = pd.DataFrame(StandardScaler().fit_transform(do[columns].values), columns=columns)
ttest_ind(do_scaled.favorite_count, do_scaled.user_favourites_count)
#Ttest_indResult(statistic=-1.682257624164912e-16, pvalue=0.9999999999999999)
#pvalue is about 1 : the association is likely due to pure chance
here a boxplot to show the distribution of outliers (StandardScaler)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
do_scaled = pd.DataFrame(MinMaxScaler().fit_transform(do[columns].values), columns=columns)
ttest_ind(do_scaled.favorite_count, do_scaled.user_favourites_count)
#Ttest_indResult(statistic=-5.999028611045575, pvalue=2.3988962933916377e-09)
#pvalue is almost 0 (less than 1%) : there is an association between predictor and response.
here a boxplot to show the distribution of outliers (MinMaxScaler)
I don't understand why I get opposite results and don't know how to interpret them. Can you please advice ? Can you please help to approach the problem ?
Topic hypothesis-testing pvalue normalization twitter statistics
Category Data Science