Understand how to simulate a statistics

This solution describes how to simulate statistics to find a confidence interval. A journalist called 1000 people in town to ask who will they be voting for out of candidates A and B. The observed value came out to be 511 votes for A and 489 votes for B. this makes us think that candidate A will win. But we need to know if this sample is truly representative of the underlying population distribution. To find this, we simulate this poll 1000 times through below python function.

def sample(A,n=1000):
    return pd.DataFrame({'vote': np.where(np.random.rand(n)  A,'A','B')})

s = sample(0.51,n=1000)

dist = pd.DataFrame([sample(0.51).vote.value_counts(normalize=True) for i in range(1000)])

what I cannot understand is, what is the significance of parameter A in the function definition.

Is it trying to simulate a sample where A occurs 51% times? If I am just trying to random samples from a population, why am I relying on 0.51 to do so?

Topic confidence distribution simulation python

Category Data Science


The simulation aims at quantifying the uncertainty of the poll. Candidate A wins win with 51% of the votes. How likely is this?

The code simulates this poll (1000 times as you see in the range function) so that we get the probability of A winning the poll. You can calculate it with e.g. dist.loc[dist.A>dist.B]['A'].count()/len(dist). You'll find here that A wins with a confidence of ~ 73%. With this you can answer questions like "How big should the size of polled population ($n$) be to get a confidence of 95%?"

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.