Performing EDA on a dataset with missing features

I'm new to DS. I want to perform EDA on such dataset, where these are the missing features stats of my train and test sets: train: Test_0 0 Test_1 31 Test_2 0 Test_3 141 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 1045 Test_9 0 Test_10 0 Test_11 0 Test_12 0 Test_13 0 Test_14 0 Test_15 2967 Class 0 dtype: int64 test: Test_0 0 Test_1 7 Test_2 0 Test_3 46 Test_4 0 Test_5 0 Test_6 0 Test_7 0 Test_8 …
Category: Data Science

Filling NaN values

According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …
Category: Data Science

Industry analysis - multiple industries

I am trying to run logistic regression on marketing leads and use industry as a predictor of whether the lead converts (1/0). Often, when I enrich data from websites like crunchbase the associated company for a lead has multiple industry associations. I am looking for help in finding a method, as well as R script, to separate these values and ultimately and identify the industries that are the best predictors of conversion. If anyone can offer some guidance here it …
Category: Data Science

How to predict strategy based on given data using Machine Learning?

My basic goal is to predict strategy based on given data for instance a) Predict what formation In a football match will maximize my winning rate b) Predict what product combination will maximize my sales rate in the Grocery store How to deal with such problems in machine learning? What approach is used in such problems?
Category: Data Science

Does the sign of correlation matter in feature selection?

If I understand correctly, the correlation between features and the target can be used to quantify whether those features are relevant to keep, hence the ritual of plotting the correlation matrix as a key step in data exploration. However, does the sign of the correlation matter when it comes to feature selection? Isn't the only thing that matters the strength of the correlation (or anti-correlation)?
Category: Data Science

Practical Interpretation of PCAs for a supplier analysis

I am using PCA to validate and research a set of 13 suppliers of products against a set of about 50 variables and performance indicators against an ideal "wish"-Supplier, mostly based on G. Jankers Book on Factor Analysis for Supplier a Rating System. While using R Studio I use my data to perform the PCA with prcomp. My question is regarding practical statements of the outcomes of the PCA and its factors. My Goal is to identify the perfomance indicators, …
Category: Data Science

When would you use feature optimization method instead of exploratory analysis to identify best features?

I have a dataset with around 70 features. I'm currently just plotting graphs and trying to identify key information. I also wish to later do a predictive model. What would be the best way to get the best features? Would it be wise to go through every column and try and spot trends and correlation? Or would it be sensible to just use a wrapper method or genetic algorithm search? Or just do a random forest classifier on the whole …
Category: Data Science

Creating sub categories

I have data we have collected quarterly over the last two years from two organisations. They are collected via the use of 29 questions. For each organisation, there are about 500 answers per question. The number which is produced for each quarter, question and organisation is an average score (1-10). Example of 5 questions is below: The issue I am trying to solve is the second column. We use these tags to create a sub category or score. However, having …
Category: Data Science

Determine which factor is responsible for a change in a top-line business metric?

Are there any techniques for determining which factor(s) is (are) responsible for a change in a top-line business metric? E.g., revenue drops - but was it because of a drop in global visitors, or perhaps a drop in conversion rate at the London store, or maybe there were heavy discounts on the weekend, etc. So far I've explored Value Driver Analysis, Sensitivity Analysis, Root Cause Analysis, Factor Analysis, but I'm not sure if they're useful. Example I have $n$ retail …
Category: Data Science

Why are correlation matrices used versus a matrix of R^2 values?

I'm relatively new to DS, so forgive me if this is a dumb question or in the wrong forum When evaluating features it seems that almost everywhere a correlation matrix is used [df.corr(), cor(df, method="pearson")]. The way I understand it is that a correlation matrix describes the stregnth and directionality of the linear relationship (strong negative through strong positive) between each feature/predictor and all others. HOWEVER If $R^2$ indicates the amount of variability explained by the linear relationship, between each …
Category: Data Science

What conclusion can I get when the variable is influenced by other but there isn't any correlation?

I am doing an analytic exploratory analysis. If the target is a continuous variable and the attributes are all categorical (discrete values), in order to know if exist any influence on the target from the each attribute I am doing the ANOVA-test like this: fvalue, pvalue = stats.f_oneway(df[y], df[x]) pvalue < 0.5 If that condition is true, there is a dependency between variables. For all variables I get true dependency with ANOVA, but the values of the correlation are between …
Category: Data Science

Factor Analysis with Mixed Data Concurrent Approach with PCAmixdata in R

I am trying to perform Factor Analysis over Mixed Data using R with PCAmixdata package. My dataset is huge with almost 115000 records and almost 40 features of both categorical and continuous. When I tried to run PCAmixdata, I am getting memory issue that total memory allocation is reached and I am not able to proceed, I wanted to know if it is a right way to split the dataset row-wise like 30000 records at a time and combine the …
Category: Data Science

SEM (Structural Equation Modelling) with Exploratory Factor Analysis

Problem Statement: I need to do some Structural Equation Modelling at work to get the main factors in a marketing survey data-set. There are no assumed equations to perform SEM on so what would be the best exploratory way to create those equations out of the data All variables are very highly correlated across all the variables so how can we deal with that Please let me know if I can help you with any supplements.
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.