Multiple Hypotheses in python
I want to write a method to test multiple hypotheses for a pair of schools (say TAMU and UT Austin). I want to consider all possible pairs of words (Research Thesis Proposal AI Analytics), and test the hypothesis that the words counts differ significantly across the two schools, using the specified alpha (0.05) threshold.
Only need to conduct tests on words that have non-zero values for both schools. I.e., every row and column in the contingency table should sum to 0
.
Finally, want to return a tuple with the
- The total number of tests conducted, and
- The number of significant tests.
Sample data frame:
Names | Research | Thesis | Proposal | AI | Analytics Data |
---|---|---|---|---|---|
TAMU | 54 | 0 | 0 | 6 | 5 |
uiuc | 33 | 43 | 5 | 0 | 76 |
USC | 4 | 1 | 0 | 7 | 21 |
UT Austin | 22 | 31 | 0 | 0 | 55 |
UCLA | 55 | 6 | 7 | 9 | 11 |
from scipy.stats import chi2_contingency
def school_term_hypotheses(filename,college1, college2, alpha):
df=pd.read_csv(filename)
df=df[(df['Name'] == college1) | (df['Name'] == college2)]
df=df.loc[:, df.ne(0).all()]
df=df.set_index('Unnamed: 0')
#chi,p=chi2_contingency(df)[:2]
#return(p)
school_term_hypotheses(test.csv, 'TAMU','UT Austin' 0.05)
I am clueless about what to do after getting a df with non-zero values. need some help figuring how do I test multiple hypotheses.
Topic chi-square-test pvalue scipy machine-learning
Category Data Science