I am doing a feature selection for a data science project with one of those feature being a high cardinality categorical variable (for context, it’s nationality). I know chi-square test could handle multiclass feature like mine but I need to do one-hot encode (dividing a multiclass variable into multiple binary variable based on its values) to be able to input it into my machine learning algorithm (spark mllib). My question is does doing one-hot encode effects the result of a …
According to my knowledge, before filling nan values we have to check whether data is missing because of MCAR, MAR or MNAR and it depends on how features are correlated with each other and then make a decision, which one to apply. So, my question is, is it a good practice to check the dependency of features with chi square independence test. If not please suggest me, what techniques to use or apply to fill nan values. I will be …
I have a set of values for a surface (in pixels) that becomes bigger over time (exponentially). The surface consists of cells that divide over time. After doing some modelling, I came up with the following formula: $$S(t)=S_{initial}2^{t/a_d},$$ where $a_d$ is the age at which the cell divides. $S_{initial}$ is known. I am trying to estimate $a_d$. I simply tried the $\chi^2$ test: # Range of ages of division. a_range = np.linspace(1, 500, 100) # Set up an empty vector …
I am running Feature selection using Chi2 code on some data ,the diabetes dataset and the HR dataset from Kaggle. While running the code on diabetes, all is good because the values are all numeric hence are converted to float. But the HR data has string values such as "Job Title" , so Python cannot convert it into a float understandably. My question is, is there a way I could run such a code on non numeric data to derive …
I need to do a chi square test of two of my dataset's categorical variables. This two variables have basically the same meaning but comes from two different sources, so my idea is to use a chi square test to see how "similar" or correlated, these two variables really are. To do so, I've written code in Python, but the p-value I get from it is exactly 0 which sounds a little strange to me. the code is: from scipy.stats …
I have a question about the chi squared independence test, I'm working on dataset and I'm interested in finding the link between the categories of product and the gender, I plot my contingency table. contingency_table :- I found that p-value is1.54*10-5 implying that my variables are correlated. I don't really understand how is it possible because the proportion between man and women for each category are very similar.
I have a large dataset (stored in a dataframe) that needs to be sampled, so I have performed sampling on it (sampled data also stored in a dataframe) and now wish to check if the sample data is correctly representative of the population data using the chi-square test (for categorical variables). I did not get a good source for the python implementation for such a case so it would be very appreciated if anyone could help me out with how …
I am working in an experiment in which I want to analyze the impact of a treatment on two different groups of customers. Most of the method for analysis I have checked (for example t-test) have as a hypothesis the existence intragroup and crossgroup independence. I can assume the crossgroup independence because the two groups are randomly split, but I have some doubts about the meaning of the intragroup independence. We can assume that there is no causal effect of …
I ran a chi squared test on multiple features & also used these features to build a binary classifier using logistic regression. The feature which had the least p value (~0.1) had a low coefficient (=0) whereas the feature which had a higher p value (~0.3) had a high coefficient (~2.9). How do I interpret this? Is it possible for a feature to have low p value but have zero coefficient?
I want to write a method to test multiple hypotheses for a pair of schools (say TAMU and UT Austin). I want to consider all possible pairs of words (Research Thesis Proposal AI Analytics), and test the hypothesis that the words counts differ significantly across the two schools, using the specified alpha (0.05) threshold. Only need to conduct tests on words that have non-zero values for both schools. I.e., every row and column in the contingency table should sum to …
I am working on a dataset that has both categorical and numerical (continuous and discrete) features (26 columns, 30244 rows). Target is categorical (1, 2, 3) and I am performing EDA on this dataset. The categorical features with numerical values (ex: gender has values 0 and 1) are also considered when taking the heatmap with seaborn. As per my knowledge, the heatmap is drawn to check the correlation between continuous numerical features right (correct me if I am wrong). Should …
I have a categorical variable with 2 categories ("Health") ('healthy', 'not_healthy') and another categorical variable ("country") with 5 categories ("english", "eua", "Australia", "spain", "Germany"). I want to check if there is any relation between health and country. I can perform chi-squared and, having a p-value < 0.05, I can reject the null hypothesis and states that under a confidence interval of 95%, country is related to health. However, what now I want to know is which country is more related …
I would have a question on the contingency table and its results. I was performing this analysis on names starting with symbols as a possible feature, getting the following values: Label 0.0 1.0 with_symb 1584 241 without_symb 16 14 getting a p-value which lets met conclude that variables are associated (since it is less than 0.05). My question is if this result might be a good result based on the chi-squared test, so if I can include in the model. …
I am a newbie to data mining. I am trying to find associations between two categorical variables. Since more than 20% of my expected frequencies are less than 5, I wanted to use Fisher exact test but it turns out it is generally used for contingency tables 2x2 but my variables have more than two values. Would really appreciate recommendations on the best course of action for me now. Here are some options I found after some search: Use Freeman-Halton …
I want to use a chi square test but I'm unsure if I'm using it right. The KickStarter website shows the frequency of main categories projects. It is updated once a day. I got a data set of KickStarter Projects from 2009 -2016. I wanted to filter the data by year including only projects that launched between jan - jun and count the frequency of the categories. I would then perform multiple tests for each year with what kickStarter posted. …
I am trying to use Granger Causality test: https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.grangercausalitytests.html to assess whether "positivity score" affects value. Here is the code I am using: # Applying differencing condensed_df['value'] = condensed_df['value'] - condensed_df['value'].shift(1) condensed_df = condensed_df.drop(0) # Running granger causality test dct_pos_granger_causality = grangercausalitytests(condensed_df[["value", daily_avg_positive_score"]], maxlag = 4, verbose=False) I have a total of 1,008 rows in the dataframe. The results are as follows: {1: ({'ssr_ftest': (0.005356633438031601, 0.941670291866298, 1003.0, 1), 'ssr_chi2test': (0.0053726552728412666, 0.9415686658133314, 1), 'lrtest': (0.005372640925997985, 0.9415687436896775, 1), 'params_ftest': (0.0053566334379265765, 0.9416702918669032, 1003.0, …
I am experimenting a course's teorical contents on this dataset. After data cleaning, I am trying to use chi-square test. I wrote the following code: chisq.test(chocolate$CompanyMaker, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$Rating, simulate.p.value = TRUE) chisq.test(chocolate$CompanyMaker, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$SpecificBeanOriginOrBarName, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$CompanyLocation, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BeanType, chocolate$CocoaPerc, simulate.p.value = TRUE) chisq.test(chocolate$BroadBeanOrigin, chocolate$CocoaPerc, simulate.p.value = TRUE) And these are my results: …
I have a XGBoost model and I'm going to retrain it by adding new features. There is a column in my data and it's about professions of the customers. It has 60 categories. I suppose there is no need to convert them to dummy variables because tree based models can handle them, but I thought that there should be many splits in order to do it and I decided to use a subset of categories and group other categories under …