Changes in the standard Heatmap plot - symmetric bar colors, show only diagonal values, and column names at x,y axis ticks

I have a heatmap image (correlation between all matrix columns) and I'm straggling to preform all the changes below within the same image: bar colors should be symmetric around zero (e.g., correlation of 1 and -1 should be with the same color) change the correlation matrix to a diagonal matrix, since correlation values are symmetric - and show only upper matrix triangle (mask out the lower triangle ) show the correlation values in every cell of the diagonal matrix x,y …
Category: Data Science

Does lightGBM handle multicollinearity?

I have a dataset after feature selection of around 6500 features and 10,000 data rows. I am using LightGBM model. I want to know if I should check the feature set for multicollinearity. If two or more features are correlated how does it affect the tree building and classification prediction How does LightGBM deal with multicollinearity? Does it have any adverse effects?
Category: Data Science

Python: calculate the weighted average correlation coefficient

I am calculating the volatility (standard deviation) of returns of a portfolio of assets using the variance-covariance approach. Correlation coefficients and asset volatilities have been estimated from historical returns. Now what I'd like to do is compute the average correlation coefficient, that is the common correlation coefficient between all asset pairs that gives me the same overall portfolio volatility. I could of course take an iterative approach, but was wondering if there was something simpler / out of the box …
Category: Data Science

Temperature lag forecasting

I am working on a data science project on an industrial machine. This machine has two heating infrastructures. (fuel and electricity). It uses these two heatings at the same time, and I am trying to estimate the temperature value that occurs in the thermocouple as a result of this heating. However, this heating process takes place with some delay/lag. In other words, the one-unit change I have made in fuel and electrical heating is reflected in the thermocouple hours later. …
Category: Data Science

How to find lagged cross correlation between time series?

I have 2 time series, $X$ and $Y$, and I'm trying to find the best lag range that correlates $X$ to $Y$ (find the amount(s) of lag of $X$ that best correlate to the target variable $Y$). For instance, if the best lag range is between $t = 8$ and $t = 10$, then the final equation would be $Y_t = \alpha_1 X_{t-8} + \alpha_2 X_{t-9} + \alpha_3 X_{t-10} + \alpha_4$. Since the value of $Y$ could depend not only …
Category: Data Science

Finding the worst affected industry due to COVID in terms of unemployment

My goal is to find the worst affected industries from COVID—19 in terms unemployment. In terms of the data I will use for this task, I have a time series county-wise unemployment rate data of each month and business distribution data. Business distribution data contains number of establishments in each county by their respective industries. (Manufacturing -121, Accommodation and Food Services -564, Construction-32 etc.) Unemployment rate data gives monthly unemployment rate in each county. From this data, what would your …
Category: Data Science

Detecting abundance of a certain periodic pattern in a time series?

I am really stumped at the moment about how to solve a particular problem. I have many time series like this: This represents the number of hours a person spends on a website each day throughout the year. Any days where they are not seen to be using the website have zero values, rather than missing values. What I really want to do is to calculate a metric telling me to what extent there is a consistent "1 hour per …
Category: Data Science

ML methods for vector correlation

I am dealing with a timeseries consisting of input flow sampled every 5 minutes over 441 days. My aim is to find any possible correlation from data coming from: The same day of the week The same moment in time I proceeded to sample according to weekdays and hours. Then I computed the 63x63 correlation matrix for each of the weekdays and a 441x441 for each hour, which in the second case is pretty impractical. I feel like this way …
Category: Data Science

How to set the same number of datapoints in the different ranges in correlation chart

I am beginner in working with machine learning. I would like to ask a question that How could I set the same number of datapoints in the different ranges in correlation chart? Or any techniques for doing that? . Specifically, I want to set the same number of datapoints in each range (0-10; 10-20;20-30;...) in the image above. Thanks for any help.
Category: Data Science

Treating highly correlated features to the label feature

We work on a dataset with >1k features, where some elements are temporal/non-linear aggregations of other features. e.g., one feature might be the salary s, where the other is the mean salary of four months (s_4_m). We try to predict which employees are more likely to get a raise by applying a regression model on the salary. Still, our models are extremly biased toward features like s_4_m, which are highly correlated to the label feature. Are there best practices for …
Category: Data Science

Which stage should the correlation analysis be done?

I was thinking about it, but I couldn't find a logical explanation. Mostly im following below steps after data become ready: Correlation analysis and elimination Apply dummy if categorical variables exist Balance the data if data is unbalanced Scale data Feature selection (Backward, Stepwise etc.) Train model Where would the correlation analysis be applied for this path I followed would make more sense? After the data is balanced? After scaling? Or at first?
Category: Data Science

distinct Correlation and Causation of data

i have a lab to find a cause-effect relationship between 2 columns of a data. First i want to ask: causation is correlation, i mean that causation is subset of correlation. I ask this cause i saw first we need to find what's columns have correlation with the other to find will these have cause-effect relationship. Will 2 columns have no correlation have Causation? Second, go into real data. i use covid data from link and crawl it day by …
Topic: correlation
Category: Data Science

Dropping highly correlated features

I am making my classification project and I have this situation after using seaborn heatmap. Column 0 is my target, where I have data with 3 classes. To my knowledge I should remove column highly correlated with target value. My questions are: Should I remove also features highly correlated with features different than target? For example I can see very high correlation between column 27 and 46. Should I remove one of them? This is correlation heat map from my …
Category: Data Science

Alternative methods for novelty detection and correlations

Hey mates I have the following project: Imagine having two datasets A and B. Each dataset consits of 101 time series with the same lenght and identical time stamps. The two datasets where taken from the same experiment, therefore the data structure is identical. From the 101 time series there is one particulary signal that is of interest in both datasets. Lets call that signal X(t)_101. Now we have the following case that the signal X(t)_101 from dataset A (good …
Category: Data Science

Drawing validation set from test set

I am building a 3 neural network models on dataset that is already separated to train and test sets. From my analysis, I found that this dataset has values on test set which don't exist in the train set. And this gives a certain limitation or maximum capacity to my neural network model(s). By this I mean, I can not seem to improve the accuracy even if I change the hyper parameters or the parameters of my models. I have …
Category: Data Science

How to interpret two continous variables output using GAM?

I really need help with GAM. I have to find out whether association is linear or non-linear by using GAM. The predictor variable is temperature at lag0 and the output is cardiovascular admissions (count variable). I have tried a lot but I am not able to understand how to interpret the graph and output that I am getting. I tried this formula using mgcv package: model1<- gam(cvd ~ s(templg0), family=poisson) summary(model1) plot(model1) So here is the output for summary that …
Category: Data Science

Filling NaN values

According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …
Category: Data Science

does R2 diverge because of a lack of input dimensions?

I try to improve my R2 score between theoretical and real output values. On the picture you can see two cases: the blue one is an artificial case I’m completely mastering with 7 dimensions as input and 1 dimension as output; The orange curve is a real case, 7 inputs 1 output. As you can see, the blue curve respond as expected. The more I add data, better is the prediction. BUT, with the orange case, this is the opposite. …
Category: Data Science

Time Posting Data Analysis

I work in a professional services company and would like to get some analytics on how discipline the fee earners post their times to the system may have an impact to the revenue. One area that I am thinking of is to see if there is a correlation between how late the time entry is posted after the work was done and whether that time entry is being written off (i.e. does not make it to the bill to the …
Topic: correlation
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.