Search one 2D distribution for point cluster most similar to another 2D distribution

Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …
Category: Data Science

Co-joining multi-peak histograms

I am analysing a bunch of data files which represent responsiveness of cells to addition of a drug. If a drug is not added, cell responds normally, if it is added, it shows abnormal patterns: , . We decided to analyse this using an amplitude histogram, in order to distinguish between a change in amplitude and in change of a probability of elliciting the binary response. What we get with file 1 is : So we fit a pdf on …
Category: Data Science

How to make a gaussian distribution in python considering mean. variance. skewness and kurtosis?

np.random.normal(mean,sigma,size) allows to create a gaussian distribution based only on mean and variance. I want to create a distribution based on function_name(mean,sigma,skew,kurtosis,size). I tried scipy.stats.gengamma but I don't understand how to use it. It takes 2 parameters - a,c and creates a distribution. But it is difficult to interpret for what values of a & c, the function will give a particular value of skewness and kurtosis. Can anyone explain how to use gengamma or any other way to create …
Category: Data Science

How to combine data having similar distribution?

I have a collection of time series data with data points of around 2 years of daily data. I am thinking of a way to increase the number of data points in it so that the neural network gets a better understanding of the fluctuations in the data. I am suggesting a hypothesis where I try to cluster similar time-series data following similar distribution, in order to increase the number of data points fed into the neural network. Is this …
Category: Data Science

Analysis of probability distribution of each features and Machine Learning

While I know that probability distributions are for hypothesis testing, confidence level constructions, etc. They definitely have many roles in statistical analysis. However, it is not obvious to me now how probability distributions come in handy for machine learning problems? In ML algorithms, they are expected to automatically pick up distributions from dataset. I wonder if there are any places of probability distributions in better solving ML problem? Shortly put, how could statistical techniques related to probability distributions can benefit …
Category: Data Science

Finding the worst affected industry due to COVID in terms of unemployment

My goal is to find the worst affected industries from COVID—19 in terms unemployment. In terms of the data I will use for this task, I have a time series county-wise unemployment rate data of each month and business distribution data. Business distribution data contains number of establishments in each county by their respective industries. (Manufacturing -121, Accommodation and Food Services -564, Construction-32 etc.) Unemployment rate data gives monthly unemployment rate in each county. From this data, what would your …
Category: Data Science

How to compare distribution of 2 continuous variable datasets

i want to compare 2 datasets and check for their similarity. I have tried statistical tests like ks test , z test but they gave a p value of 0.0 for most columns. I then read ks test won't work because the dataset size is huge and it will exaggerate even slight differences. Then I tried bhattacharya distance, helinger distance but the probability values are coming 0.01 (which is correct since it is continuous variable) . I am trying to …
Topic: distribution
Category: Data Science

Distribution Shift vs Transfer Learning

Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem [1] Distribution Shift The conditions under which the system was developed will differ from those in which we use the system. [2] I consider there is no difference between distribution shift and dataset shift. But between transfer learning and distribution shift? What are the differences? Can we say that transfer …
Category: Data Science

Check if distribution per week is the same

I have sales by customer (b2b) and by date. I want to check if the distribution per day inside weeks remains the same from week to week. Initial dataset Customer Date Sales Alpha 2019-02-23 527 Beta 2019-02-23 642 Alpha 2019-02-24 776 ... ... ... Beta 2021-07-28 1236 I transformed it into Customer Week Monday Tuesday Wednesday Thursday Friday Saturday Sunday Alpha 201906 0.2202 0.15799 0.178202 0.160449 0.1528 0.130214 0.000067 Beta 201906 0.20573 0.183979 0.182207 0.179824 0.140596 0.107601 0.000061 ... ... …
Category: Data Science

A/B Testing (Binomial Distribution vs Random Distribution)

When performing an A/B test for the number of clicks for users viewing (each view is an impression) two variants of an ad, a binomial distribution can be assumed where each variant has a constant click-through rate. Example: Two Ads, -> Ad one has 1000 impressions and 20 clicks, CTR is 2%; -> Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is a difference between Click Through Rate (CTR) between Ads one and two. …
Category: Data Science

the mean and standard deviation aren't the same as those of the input data i provided after sampling

I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …
Category: Data Science

Getting a balanced sample across many variables

Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F. Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same. What should …
Category: Data Science

derivation for expected value for variance

Hi Im taking a course about probability distribution in datascience and below is derivation of the expected value for the variance Variance = expected value of the squared difference from mean for any value. But generally, variance is just the difference between the value and its mean. Why are we squaring and adding the expected value symbol? $$\sigma^2 = E((Y - \mu)^2) = E(Y^2) - \mu^2$$ For the first step in derivation, why do we multiply the summation of $p(x)$ …
Category: Data Science

How to properly setup jensen_shannon_divergence and infinity norm in tensorflow data validation for skew and drift checks

Tensorflow data validation offers the capability of checking data skew and drift and the documentation also mention that "Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation" How can one specify an initial reasonable jensen_shannon_divergence threshold (or infinity_norm one for categorical features)? Is there some python package/utility/code that one can leverage on a given data set feature to compute a reasonable threshold? If not, what is recommended to proper conduct the experiment and find the …
Category: Data Science

Generating a set of different scenarios based on some initial observations

I have a in my hands 3 different time series which model 3 different scenarios (base, downside, upside). Every of this time-series depends on a set of 11 different attributes, which take values for different time intervals. Most of the different features of the input are highly correlated. There is also a (cdf) probability function which defines how probably every scenario is (every quintile), for every point in time. In my case, I want to create more input data based …
Category: Data Science

How to measure statistical similarity or discrepancy between a dataset and a distribution?

Is any way to measure statistical similarity or discrepancy between a dataset and a distribution? I have do some research, but find most of method are intended to describe discrepancy between data and data, or between distribution and distribution. That is to say, they always are measure the same kind of thing. What I looking for is a method can measure discrepancy between a dataset and a distribution. It would be nice if there were a corresponding method that easy …
Category: Data Science

Normal vs Uniform Distribution for machine learning

I have a dataset that follows Zipf's law such that the majority of the values are concentrated at one end, with the remaining items containing a very small percentage. Training on the dataset as is would introduce a bias, and thus I was thinking of restructuring the data to fall into buckets. Thus my model would be a multi-class classification model, rather than a regression model (I am training a NN). My question is whether I should draw up the …
Category: Data Science

Is it possible to find an internal event in a Classification between two classes?

I am very new to the machine learning area. So my question might be trivial I have two classes $U, V$ of binary vectors. In the training phase, I use $u_1,\ldots, u_{1000}$ from $U$ class and $v_1, \ldots, v_{1000}$ from $V$. In the testing phase, I have to determine whether a vector is coming from $U$ or $V$? How can we do that with good accuracy? Also, can we find internal event by which ML makes the clasification?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.