I am calculating the volatility (standard deviation) of returns of a portfolio of assets using the variance-covariance approach. Correlation coefficients and asset volatilities have been estimated from historical returns. Now what I'd like to do is compute the average correlation coefficient, that is the common correlation coefficient between all asset pairs that gives me the same overall portfolio volatility. I could of course take an iterative approach, but was wondering if there was something simpler / out of the box …
I am working on a regression problem trying to predict a target variable with seven predictor variables. I have a tabular dataset of 1400 rows. Before delving into the machine learning to build a predictor, I did an EDA(exploratory data analysis) and I got the below correlation coefficients (Pearson r) in my data. Note that I have included both the numerical predictor variables and the target variable. I am wondering about the the following questions: We see that pv3 is …
At my office, I am stuck in a weird situation. I am asked to perform a regression algorithm on the data, in which the target variable is continuous having values range between 0.6 to 0.9 with 8 digits of precision after the decimal. Although I know and have applied many linear and non-linear regression algorithms in the past the case here is something different. There is one variable, which, according to my BU, should have a positive and linear correlation …
I have four features x1,x2,x3,x4 all of their correlation with y are similar in Pearson and in Spearman separately. However, all these are +0.15 in Pearson and -0.6 in Spearman, more or less. Does this make sense? How can I interpret this result? All four features and target are definitely related. From a common sense perspective the sign of Spearman is more accurate as well.
I am new to Data Science and I have a python data frame with Number of days, CountofJobs, and AmountEarned what statistical method should I use to find a correlation between Days and AmountEarned. NumberofDays CountofJobs AmountEarned 20 3 50000 22 18 10000 35 10 80000
Context: I'm currently crafting and comparing machine learning models to predict housing data. I have around 32000 data points, 42 features, and I'm predicting housing price. I'm comparing Random Forest Regressor, Decision Tree Regressor, and Linear Regression. I can tell there is some overfitting going on, as my initial values vs cross validated values are as follows: RF: 10 Fold R Squared = 0.758, neg RMSE = -540.2 vs unvalidated R Squared of 0.877, RMSE of 505.6 DT: 10 Fold …
I have a dataset having 22 independent variables out of which 15 are categorical data that has already been label encoded i.e the dtype is int64 and the contents are in a range of 0 to n (n is the number of distinct classes). I got the data in this form and didnot have to encode it. Since, the data has been already encoded I can directly use python pearson's correlation to get the correlation matrix for all combinations (encoded-encoded, …
Given the following dataframe age job salary 0 1 Doctor 100 1 2 Engineer 200 2 3 Lawyer 300 ... with age as numeric, job as categorical, I want to test the correlation with salary, for the purpose of selecting the features (age and/or job) for predicting the salary (regression problem). Can I use the following API from sklearn (or other api) sklearn.feature_selection.f_regression sklearn.feature_selection.mutual_info_regression to test it? If yes, what's the right method and syntax to test the correlation? Following …
I have Pandas DataFrame with multiple columns (3000 or more) with timeseries in them (I have dates as indecies). |id1 id2 id3 ------------------------------- 2021-01-06 | 27 29 5 2021-01-07 | 24 20 9 ... 2021-01-08 | 21 13 14 2021-01-09 | 10 6 24 ... And I need to do rolling window computations of Pearson correlation on each pair of columns. I'm using multiprocessing and regular pandas.DataFrame.corr() function and it take days to complete the calculation. Is it possible to …
I'm trying to calculate corr-coef(or other good relation function) of State and Action in 'delayed action effect' Envrironment. In this environment, agent observes states, then it returns action and reward. But actions effect to 'T' time after states. So, it is really hard to imagine how to access this environment.(Because this is the first trial.) Are there any good approaches in this situation?
We perform data analysis and build models. Say, for example, I built a regression model that has more than one predictor (multiple regression). We then check many things: normality, multicollinearity, etc. Specifically, we check for multicollinearity, for a numeric/continuous variable, VIF (Variance Inflation Factors) etc. If we find that there is multicollinearity, we then drop one of the highly correlated features. My question is: what can be done with categorical variables? I mean if two categorical variables are correlated/associated does …
I have a dataset of process data for different equipment with many sensors. I would like to check the correlation of the different sensors to see if there is any strong correlation between some sensors and potentially reduce the size of my dataset. Within this process data there are many different processes of varying lengths and different equipment. For now I am asserting that the different equipment shouldn't make a difference and therefore I do not want to include this …
Does having a positive or negative correlation between features being clustered affect the agglomerative clustering result? I have three columns in my dataset, and I'm trying to figure out if I should cluster on all three features or use only a subset. The Pearson correlation coefficients are: X & Z --> -0.07, p=0.14 X & Y --> -0.08, p=0.08 Z & Y --> 0.68, p<0.001 The Variance Inflation Factor is: variables VIF Y 2.816716 X 3.552227 Z 6.232414 Should I …
What are the characteristics of the three correlation coefficients and what are the comparisons of each of them/assumptions? Can somebody kindly take me through the concepts?
I am using the fourth-corner method in one of my papers (for those who need the name). The method was developed to test associations between variables in two datasets. In my case, the datasets contains traits of species (e.g. trait Size with modalities 'small', 'medium', 'large'). The method recognizes the data type and then apply appropriate statistics. The correct cases: If two variables are quantitative, the fourthcorner calculates Pearson correlations. If two variables are qualitative, factorial, the method calculates a …
The independent variables in the dataset contains categorical variables such as Gender ( 2 levels) Mode of Shipment ( 3 levels) Product Importance ( 4 levels) and Numerical Variables such as Customer care calls Discount Offered Package weight How do I find the correlations between these variables? Converting categorical variables in to dummy variables and then using pearson correlation? What if the dummy variable categories also shows correlations too? such as correlation between Mode of shipment categories, Flight, ship, road? …
I calculated the Pearson correlation coefficient between two signals, that described the state of the unit. During normal operation of the unit, both signals were fairly stable and fluctuated very little. At some point in time, the unit began to form a defect, in connection with which the oscillations of the signals increased, and also a trend of growth in their absolute values began to be observed. Signals describing the normal operation of the unit. Signals describing emergency operation of …
I am struggling to find out a suitable way to calculate correlation coefficient for categorical variables. Pearson's coefficient is not supported for categorical features. I want to find out features with most highest influence on the target variable. My objectives are: Correlation between categorical and categorical variables. e.g. For a binary target (like Titanic dataset), I want to find out what is the influence of a category on the target (like, influence of gender on survival (0/1)) Capture some non …
I would like to understand how to find an association between users, spam and email's age. My dataset looks like as follows: User Spam Age (yr) porn_23 1 1 Mary_g 0 6 cricket_s54 0 4 rewuoiou 1 0 pure75 1 2 giogio35 0 10 viv3roe 1 1 I am looking at the correlation using Pearson. Is it right? I would like to determine the correlation between age and user: spam email should likely come from users having recent email's addresses …
I have two datasets with which I want to do a Pearson correlation analysis. I have carried out the analysis which makes sense, however I want to be sure it is valid given that both datasets have values on different scales. The features in both datasets are exactly the same (the actual samples are of course different). The range of values are as follows: dataset1 = 3-20 dataset2 = 10-30 Now my understanding is that pearson correlation coefficient is not …