Does one-hot encode effects chi-square test?

I am doing a feature selection for a data science project with one of those feature being a high cardinality categorical variable (for context, it’s nationality). I know chi-square test could handle multiclass feature like mine but I need to do one-hot encode (dividing a multiclass variable into multiple binary variable based on its values) to be able to input it into my machine learning algorithm (spark mllib). My question is does doing one-hot encode effects the result of a …
Category: Data Science

Filling NaN values

According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …
Category: Data Science

Linear regression with a fixed intercept and everything is in log

I have a set of values for a surface (in pixels) that becomes bigger over time (exponentially). The surface consists of cells that divide over time. After doing some modelling, I came up with the following formula: $$S(t)=S_{initial}2^{t/a_d},$$ where $a_d$ is the age at which the cell divides. $S_{initial}$ is known. I am trying to estimate $a_d$. I simply tried the $\chi^2$ test: # Range of ages of division. a_range = np.linspace(1, 500, 100) # Set up an empty vector …
Category: Data Science

ValueError for Chi2 Python

I am running Feature selection using Chi2 code on some data ,the diabetes dataset and the HR dataset from Kaggle. While running the code on diabetes, all is good because the values are all numeric hence are converted to float. But the HR data has string values such as "Job Title" , so Python cannot convert it into a float understandably. My question is, is there a way I could run such a code on non numeric data to derive …
Category: Data Science

p-value of chi squared test is exactly 0.0

I need to do a chi square test of two of my dataset's categorical variables. This two variables have basically the same meaning but comes from two different sources, so my idea is to use a chi square test to see how "similar" or correlated, these two variables really are. To do so, I've written code in Python, but the p-value I get from it is exactly 0 which sounds a little strange to me. the code is: from scipy.stats …
Category: Data Science

Why do I get this result with a chi- square test?

I have a question about the chi squared independence test, I'm working on dataset and I'm interested in finding the link between the categories of product and the gender, I plot my contingency table. contingency_table :- I found that p-value is1.54*10-5 implying that my variables are correlated. I don't really understand how is it possible because the proportion between man and women for each category are very similar.
Category: Data Science

Chi - Square test for Validating Sampled Data

I have a large dataset (stored in a dataframe) that needs to be sampled, so I have performed sampling on it (sampled data also stored in a dataframe) and now wish to check if the sample data is correctly representative of the population data using the chi-square test (for categorical variables). I did not get a good source for the python implementation for such a case so it would be very appreciated if anyone could help me out with how …
Category: Data Science

Intragroup indepence in two groups analysis

I am working in an experiment in which I want to analyze the impact of a treatment on two different groups of customers. Most of the method for analysis I have checked (for example t-test) have as a hypothesis the existence intragroup and crossgroup independence. I can assume the crossgroup independence because the two groups are randomly split, but I have some doubts about the meaning of the intragroup independence. We can assume that there is no causal effect of …
Category: Data Science

Low P value in Chi-squared test but low coefficient in logistic regression

I ran a chi squared test on multiple features & also used these features to build a binary classifier using logistic regression. The feature which had the least p value (~0.1) had a low coefficient (=0) whereas the feature which had a higher p value (~0.3) had a high coefficient (~2.9). How do I interpret this? Is it possible for a feature to have low p value but have zero coefficient?
Category: Data Science

Multiple Hypotheses in python

I want to write a method to test multiple hypotheses for a pair of schools (say TAMU and UT Austin). I want to consider all possible pairs of words (Research Thesis Proposal AI Analytics), and test the hypothesis that the words counts differ significantly across the two schools, using the specified alpha (0.05) threshold. Only need to conduct tests on words that have non-zero values for both schools. I.e., every row and column in the contingency table should sum to …
Category: Data Science

Should I remove features such as gender and birth month before drawing the heatmap because they are categorical?

I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is categorical (1, 2, 3) and I am performing EDA on this dataset. The categorical features with numerical values (ex: gender has values 0 and 1) are also considered when taking the heatmap with seaborn. As per my knowledge, the heatmap is drawn to check the correlation between continuous numerical features right (correct me if I am wrong). Should …
Category: Data Science

How to get correlation between the categories of two categorical variable?

I have a categorical variable with 2 categories ("Health") ('healthy', 'not_healthy') and another categorical variable ("country") with 5 categories ("english", "eua", "Australia", "spain", "Germany"). I want to check if there is any relation between health and country. I can perform chi-squared and, having a p-value < 0.05, I can reject the null hypothesis and states that under a confidence interval of 95%, country is related to health. However, what now I want to know is which country is more related …
Category: Data Science

Chi-Squared test: ok for selecting significant features?

I would have a question on the contingency table and its results. I was performing this analysis on names starting with symbols as a possible feature, getting the following values: Label 0.0 1.0 with_symb 1584 241 without_symb 16 14 getting a p-value which lets met conclude that variables are associated (since it is less than 0.05). My question is if this result might be a good result based on the chi-squared test, so if I can include in the model. …
Category: Data Science

What is the best alternative for Fisher's Exact test for contigency tables that are NOT 2x2?

I am a newbie to data mining. I am trying to find associations between two categorical variables. Since more than 20% of my expected frequencies are less than 5, I wanted to use Fisher exact test but it turns out it is generally used for contingency tables 2x2 but my variables have more than two values. Would really appreciate recommendations on the best course of action for me now. Here are some options I found after some search: Use Freeman-Halton …
Category: Data Science

Chi Square Test Goodness of Fit

I want to use a chi square test but I'm unsure if I'm using it right. The KickStarter website shows the frequency of main categories projects. It is updated once a day. I got a data set of KickStarter Projects from 2009 -2016. I wanted to filter the data by year including only projects that launched between jan - jun and count the frequency of the categories. I would then perform multiple tests for each year with what kickStarter posted. …
Category: Data Science

Interpreting the results based on Granger Causality test

I am trying to use Granger Causality test: https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.grangercausalitytests.html to assess whether "positivity score" affects value. Here is the code I am using: # Applying differencing condensed_df['value'] = condensed_df['value'] - condensed_df['value'].shift(1) condensed_df = condensed_df.drop(0) # Running granger causality test dct_pos_granger_causality = grangercausalitytests(condensed_df[["value", daily_avg_positive_score"]], maxlag = 4, verbose=False) I have a total of 1,008 rows in the dataframe. The results are as follows: {1: ({'ssr_ftest': (0.005356633438031601, 0.941670291866298, 1003.0, 1), 'ssr_chi2test': (0.0053726552728412666, 0.9415686658133314, 1), 'lrtest': (0.005372640925997985, 0.9415687436896775, 1), 'params_ftest': (0.0053566334379265765, 0.9416702918669032, 1003.0, …
Category: Data Science

Chi-square test - how can I say if attributes are correlated?

I am experimenting a course's teorical contents on this dataset. After data cleaning, I am trying to use chi-square test. I wrote the following code: chisq.test(chocolate$CompanyMaker, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyMaker, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$CocoaPerc, simulate.p.value = TRUE) And these are my results: …
Category: Data Science

Using a Subset of Categories in a Categorical Column

I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.