chi-square-test

For feature selection, do we use Chi-squared with Mutual Information together?

O O

2022年5月18日 12:46

Or do we only choose one out of two for categorical data.

Topic: chi-square-test mutual-information feature-extraction feature-selection machine-learning

Category: Data Science

Does one-hot encode effects chi-square test?

Reynard Ryanda

2022年5月16日 12:03

I am doing a feature selection for a data science project with one of those feature being a high cardinality categorical variable (for context, it’s nationality). I know chi-square test could handle multiclass feature like mine but I need to do one-hot encode (dividing a multiclass variable into multiple binary variable based on its values) to be able to input it into my machine learning algorithm (spark mllib). My question is does doing one-hot encode effects the result of a …

Topic: chi-square-test one-hot-encoding pyspark

Category: Data Science

Filling NaN values

Mohammed Atif Ali

2022年4月20日 13:00

According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …

Topic: chi-square-test exploratory-factor-analysis missing-data correlation statistics

Category: Data Science

Linear regression with a fixed intercept and everything is in log

a0142204

2022年3月5日 14:05

I have a set of values for a surface (in pixels) that becomes bigger over time (exponentially). The surface consists of cells that divide over time. After doing some modelling, I came up with the following formula: $$S(t)=S_{initial}2^{t/a_d},$$ where $a_d$ is the age at which the cell divides. $S_{initial}$ is known. I am trying to estimate $a_d$. I simply tried the $\chi^2$ test: # Range of ages of division. a_range = np.linspace(1, 500, 100) # Set up an empty vector …

Topic: chi-square-test structural-equation-modelling linear-algebra python

Category: Data Science

ValueError for Chi2 Python

FredNina

2022年2月15日 10:49

I am running Feature selection using Chi2 code on some data ,the diabetes dataset and the HR dataset from Kaggle. While running the code on diabetes, all is good because the values are all numeric hence are converted to float. But the HR data has string values such as "Job Title" , so Python cannot convert it into a float understandably. My question is, is there a way I could run such a code on non numeric data to derive …

Topic: chi-square-test weka python

Category: Data Science

p-value of chi squared test is exactly 0.0

Michele Papucci

2022年1月27日 22:05

I need to do a chi square test of two of my dataset's categorical variables. This two variables have basically the same meaning but comes from two different sources, so my idea is to use a chi square test to see how "similar" or correlated, these two variables really are. To do so, I've written code in Python, but the p-value I get from it is exactly 0 which sounds a little strange to me. the code is: from scipy.stats …

Topic: chi-square-test pvalue scipy python

Category: Data Science

Why do I get this result with a chi- square test?

Polaster

2022年1月26日 04:50

I have a question about the chi squared independence test, I'm working on dataset and I'm interested in finding the link between the categories of product and the gender, I plot my contingency table. contingency_table :- I found that p-value is1.54*10-5 implying that my variables are correlated. I don't really understand how is it possible because the proportion between man and women for each category are very similar.

Topic: chi-square-test interpretation hypothesis-testing

Category: Data Science

Chi - Square test for Validating Sampled Data

ml_learner15

2021年9月20日 21:10

I have a large dataset (stored in a dataframe) that needs to be sampled, so I have performed sampling on it (sampled data also stored in a dataframe) and now wish to check if the sample data is correctly representative of the population data using the chi-square test (for categorical variables). I did not get a good source for the python implementation for such a case so it would be very appreciated if anyone could help me out with how …

Topic: chi-square-test sampling python categorical-data

Category: Data Science

Intragroup indepence in two groups analysis

Rodrigo Serna Pérez

2021年9月2日 17:35

I am working in an experiment in which I want to analyze the impact of a treatment on two different groups of customers. Most of the method for analysis I have checked (for example t-test) have as a hypothesis the existence intragroup and crossgroup independence. I can assume the crossgroup independence because the two groups are randomly split, but I have some doubts about the meaning of the intragroup independence. We can assume that there is no causal effect of …

Topic: chi-square-test ab-test

Category: Data Science

Low P value in Chi-squared test but low coefficient in logistic regression

user16584277

2021年8月18日 17:47

I ran a chi squared test on multiple features & also used these features to build a binary classifier using logistic regression. The feature which had the least p value (~0.1) had a low coefficient (=0) whereas the feature which had a higher p value (~0.3) had a high coefficient (~2.9). How do I interpret this? Is it possible for a feature to have low p value but have zero coefficient?

Topic: chi-square-test pvalue machine-learning-model

Category: Data Science

Multiple Hypotheses in python

stacky

2021年7月29日 17:04

I want to write a method to test multiple hypotheses for a pair of schools (say TAMU and UT Austin). I want to consider all possible pairs of words (Research Thesis Proposal AI Analytics), and test the hypothesis that the words counts differ significantly across the two schools, using the specified alpha (0.05) threshold. Only need to conduct tests on words that have non-zero values for both schools. I.e., every row and column in the contingency table should sum to …

Topic: chi-square-test pvalue scipy machine-learning

Category: Data Science

Should I remove features such as gender and birth month before drawing the heatmap because they are categorical?

leahnanno

2021年6月3日 20:37

I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is categorical (1, 2, 3) and I am performing EDA on this dataset. The categorical features with numerical values (ex: gender has values 0 and 1) are also considered when taking the heatmap with seaborn. As per my knowledge, the heatmap is drawn to check the correlation between continuous numerical features right (correct me if I am wrong). Should …

Topic: chi-square-test heatmap correlation feature-selection

Category: Data Science

How to get correlation between the categories of two categorical variable?

bonaqua

2021年4月3日 20:18

I have a categorical variable with 2 categories ("Health") ('healthy', 'not_healthy') and another categorical variable ("country") with 5 categories ("english", "eua", "Australia", "spain", "Germany"). I want to check if there is any relation between health and country. I can perform chi-squared and, having a p-value < 0.05, I can reject the null hypothesis and states that under a confidence interval of 95%, country is related to health. However, what now I want to know is which country is more related …

Topic: chi-square-test correlation categorical-data

Category: Data Science

Chi-Squared test: ok for selecting significant features?

V_sqrt

2021年2月8日 16:48

I would have a question on the contingency table and its results. I was performing this analysis on names starting with symbols as a possible feature, getting the following values: Label 0.0 1.0 with_symb 1584 241 without_symb 16 14 getting a p-value which lets met conclude that variables are associated (since it is less than 0.05). My question is if this result might be a good result based on the chi-squared test, so if I can include in the model. …

Topic: chi-square-test correlation classification feature-selection

Category: Data Science

What is the best alternative for Fisher's Exact test for contigency tables that are NOT 2x2?

wilma297

2020年12月16日 21:41

I am a newbie to data mining. I am trying to find associations between two categorical variables. Since more than 20% of my expected frequencies are less than 5, I wanted to use Fisher exact test but it turns out it is generally used for contingency tables 2x2 but my variables have more than two values. Would really appreciate recommendations on the best course of action for me now. Here are some options I found after some search: Use Freeman-Halton …

Topic: chi-square-test correlation categorical-data

Category: Data Science

Chi Square Test Goodness of Fit

Laurent

2020年8月14日 09:51

I want to use a chi square test but I'm unsure if I'm using it right. The KickStarter website shows the frequency of main categories projects. It is updated once a day. I got a data set of KickStarter Projects from 2009 -2016. I wanted to filter the data by year including only projects that launched between jan - jun and count the frequency of the categories. I would then perform multiple tests for each year with what kickStarter posted. …

Topic: chi-square-test scipy sampling pandas

Category: Data Science

Interpreting the results based on Granger Causality test

Darcey BM

2020年7月26日 22:14

I am trying to use Granger Causality test: https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.grangercausalitytests.html to assess whether "positivity score" affects value. Here is the code I am using: # Applying differencing condensed_df['value'] = condensed_df['value'] - condensed_df['value'].shift(1) condensed_df = condensed_df.drop(0) # Running granger causality test dct_pos_granger_causality = grangercausalitytests(condensed_df[["value", daily_avg_positive_score"]], maxlag = 4, verbose=False) I have a total of 1,008 rows in the dataframe. The results are as follows: {1: ({'ssr_ftest': (0.005356633438031601, 0.941670291866298, 1003.0, 1), 'ssr_chi2test': (0.0053726552728412666, 0.9415686658133314, 1), 'lrtest': (0.005372640925997985, 0.9415687436896775, 1), 'params_ftest': (0.0053566334379265765, 0.9416702918669032, 1003.0, …

Topic: chi-square-test hypothesis-testing statistics

Category: Data Science

Chi-square test - how can I say if attributes are correlated?

user96624

2020年5月23日 11:31

I am experimenting a course's teorical contents on this dataset. After data cleaning, I am trying to use chi-square test. I wrote the following code: chisq.test(chocolate$CompanyMaker, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyMaker, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$CocoaPerc, simulate.p.value = TRUE) And these are my results: …

Topic: chi-square-test correlation

Category: Data Science

Using a Subset of Categories in a Categorical Column

tkarahan

2020年4月3日 12:33

I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under …

Topic: chi-square-test one-hot-encoding xgboost categorical-data

Category: Data Science

About