Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …
I am analysing a bunch of data files which represent responsiveness of cells to addition of a drug. If a drug is not added, cell responds normally, if it is added, it shows abnormal patterns: , . We decided to analyse this using an amplitude histogram, in order to distinguish between a change in amplitude and in change of a probability of elliciting the binary response. What we get with file 1 is : So we fit a pdf on …
np.random.normal(mean,sigma,size) allows to create a gaussian distribution based only on mean and variance. I want to create a distribution based on function_name(mean,sigma,skew,kurtosis,size). I tried scipy.stats.gengamma but I don't understand how to use it. It takes 2 parameters - a,c and creates a distribution. But it is difficult to interpret for what values of a & c, the function will give a particular value of skewness and kurtosis. Can anyone explain how to use gengamma or any other way to create …
I have a collection of time series data with data points of around 2 years of daily data. I am thinking of a way to increase the number of data points in it so that the neural network gets a better understanding of the fluctuations in the data. I am suggesting a hypothesis where I try to cluster similar time-series data following similar distribution, in order to increase the number of data points fed into the neural network. Is this …
While I know that probability distributions are for hypothesis testing, confidence level constructions, etc. They definitely have many roles in statistical analysis. However, it is not obvious to me now how probability distributions come in handy for machine learning problems? In ML algorithms, they are expected to automatically pick up distributions from dataset. I wonder if there are any places of probability distributions in better solving ML problem? Shortly put, how could statistical techniques related to probability distributions can benefit …
My goal is to find the worst affected industries from COVID—19 in terms unemployment. In terms of the data I will use for this task, I have a time series county-wise unemployment rate data of each month and business distribution data. Business distribution data contains number of establishments in each county by their respective industries. (Manufacturing -121, Accommodation and Food Services -564, Construction-32 etc.) Unemployment rate data gives monthly unemployment rate in each county. From this data, what would your …
i want to compare 2 datasets and check for their similarity. I have tried statistical tests like ks test , z test but they gave a p value of 0.0 for most columns. I then read ks test won't work because the dataset size is huge and it will exaggerate even slight differences. Then I tried bhattacharya distance, helinger distance but the probability values are coming 0.01 (which is correct since it is continuous variable) . I am trying to …
Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem [1] Distribution Shift The conditions under which the system was developed will differ from those in which we use the system. [2] I consider there is no difference between distribution shift and dataset shift. But between transfer learning and distribution shift? What are the differences? Can we say that transfer …
I have sales by customer (b2b) and by date. I want to check if the distribution per day inside weeks remains the same from week to week. Initial dataset Customer Date Sales Alpha 2019-02-23 527 Beta 2019-02-23 642 Alpha 2019-02-24 776 ... ... ... Beta 2021-07-28 1236 I transformed it into Customer Week Monday Tuesday Wednesday Thursday Friday Saturday Sunday Alpha 201906 0.2202 0.15799 0.178202 0.160449 0.1528 0.130214 0.000067 Beta 201906 0.20573 0.183979 0.182207 0.179824 0.140596 0.107601 0.000061 ... ... …
When performing an A/B test for the number of clicks for users viewing (each view is an impression) two variants of an ad, a binomial distribution can be assumed where each variant has a constant click-through rate. Example: Two Ads, -> Ad one has 1000 impressions and 20 clicks, CTR is 2%; -> Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is a difference between Click Through Rate (CTR) between Ads one and two. …
I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …
Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F. Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same. What should …
Hi Im taking a course about probability distribution in datascience and below is derivation of the expected value for the variance Variance = expected value of the squared difference from mean for any value. But generally, variance is just the difference between the value and its mean. Why are we squaring and adding the expected value symbol? $$\sigma^2 = E((Y - \mu)^2) = E(Y^2) - \mu^2$$ For the first step in derivation, why do we multiply the summation of $p(x)$ …
I was asked in an interview why do we use the binomial distribution in logistic regression and how is it related to the class that we are predicting? Could anyone explain, without any mathematical equations, why do we use binomial instead on any other distribution?
Tensorflow data validation offers the capability of checking data skew and drift and the documentation also mention that "Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation" How can one specify an initial reasonable jensen_shannon_divergence threshold (or infinity_norm one for categorical features)? Is there some python package/utility/code that one can leverage on a given data set feature to compute a reasonable threshold? If not, what is recommended to proper conduct the experiment and find the …
Can somebody please help how to explain this correlation between features, as it does not have linear coorelation, but still seems to have somewhat coorelation. Here is the screenshoot:
I have a in my hands 3 different time series which model 3 different scenarios (base, downside, upside). Every of this time-series depends on a set of 11 different attributes, which take values for different time intervals. Most of the different features of the input are highly correlated. There is also a (cdf) probability function which defines how probably every scenario is (every quintile), for every point in time. In my case, I want to create more input data based …
Is any way to measure statistical similarity or discrepancy between a dataset and a distribution? I have do some research, but find most of method are intended to describe discrepancy between data and data, or between distribution and distribution. That is to say, they always are measure the same kind of thing. What I looking for is a method can measure discrepancy between a dataset and a distribution. It would be nice if there were a corresponding method that easy …
I have a dataset that follows Zipf's law such that the majority of the values are concentrated at one end, with the remaining items containing a very small percentage. Training on the dataset as is would introduce a bias, and thus I was thinking of restructuring the data to fall into buckets. Thus my model would be a multi-class classification model, rather than a regression model (I am training a NN). My question is whether I should draw up the …
I am very new to the machine learning area. So my question might be trivial I have two classes $U, V$ of binary vectors. In the training phase, I use $u_1,\ldots, u_{1000}$ from $U$ class and $v_1, \ldots, v_{1000}$ from $V$. In the testing phase, I have to determine whether a vector is coming from $U$ or $V$? How can we do that with good accuracy? Also, can we find internal event by which ML makes the clasification?