I have four features x1,x2,x3,x4 all of their correlation with y are similar in Pearson and in Spearman separately. However, all these are +0.15 in Pearson and -0.6 in Spearman, more or less. Does this make sense? How can I interpret this result? All four features and target are definitely related. From a common sense perspective the sign of Spearman is more accurate as well.
While interpreting the correlation between ranks, should I use the rho value (for spearman method), tau value (for kendall's tau method), w value ( for kendall's w method) or should I take in consideration the p-value? And does having NaNs values in the ranks impact the interpretation of the correlation?
I am new to Data Science and I have a python data frame with Number of days, CountofJobs, and AmountEarned what statistical method should I use to find a correlation between Days and AmountEarned. NumberofDays CountofJobs AmountEarned 20 3 50000 22 18 10000 35 10 80000
Given the following dataframe age job salary 0 1 Doctor 100 1 2 Engineer 200 2 3 Lawyer 300 ... with age as numeric, job as categorical, I want to test the correlation with salary, for the purpose of selecting the features (age and/or job) for predicting the salary (regression problem). Can I use the following API from sklearn (or other api) sklearn.feature_selection.f_regression sklearn.feature_selection.mutual_info_regression to test it? If yes, what's the right method and syntax to test the correlation? Following …
Problem I have a convolutional neural network model which intakes a video and outputs a continuous variable. I want to assess whether the performance of the model is associated with another continuous variable (age; not included in the model). Solution attempt If this were a linear regression model, I think I could do a Spearman rank correlation test: basically, plot the absolute value of the residuals (true value - predicted value) against the nuisance variable (age), then calculate the Spearman …
I have a dataset of process data for different equipment with many sensors. I would like to check the correlation of the different sensors to see if there is any strong correlation between some sensors and potentially reduce the size of my dataset. Within this process data there are many different processes of varying lengths and different equipment. For now I am asserting that the different equipment shouldn't make a difference and therefore I do not want to include this …
What are the characteristics of the three correlation coefficients and what are the comparisons of each of them/assumptions? Can somebody kindly take me through the concepts?
I am looking for a metric for comparing gene count tables. These are long columns of data (a few millions genes by a few dozen samples), with all non-negative entries, about 90% of which are zeros. The goal is to compare the performance of several tools/algorithms that these tables originate from, by comparing the resulting tables among themselves or with the expected counts (in a case of sumulates data). In principle, one compares on a sample-by-sample basis, but comparing different …
I have a data set with categorical and continuous/ordinal explanatory variables and continuous target variable. I tried to filter features using one-way ANOVA for categorical variables and using Spearman's correlation coefficient for continuous/ordinal variables.I am using p-value to filter. I then also used mutual information regression to select features.The results from both the techniques do not match. Can someone please explain what is the discrepancy and what should be used when ?
I have the following dataset. When I calculate the Spearman correlation coefficient with scipy.stats.spearmanr, it returns 0.718182. import pandas as pd import numpy as np from scipy.stats import spearmanr df = pd.DataFrame( [ [7,3], [6,5], [5,4], [3,2], [6,4], [8,9], [9,7] ], columns=['Set of A','Set of B']) correlation, pval = spearmanr(df) print(f'correlation={correlation:.6f}, p-value={pval:.6f}') It returns this: correlation=0.718182, p-value=0.069096 However, when I tried to calculate it manually: df_rank = pd.DataFrame( [ [5,2], [3.5,4], [2,4], [1,1], [3.5,4], [6,7], [7,6] ], columns=['Rank of A','Rank …