distribution

Search one 2D distribution for point cluster most similar to another 2D distribution

duhaime

2022年6月4日 09:55

Given a hand drawn constellation (2d distribution of points) and a map of all stars, how would you find the actual star distribution most similar to the drawn distribution? If it's helpful, suppose we can define some maximum allowable threshold of distortion (e.g. a maximum Kolmogorov-Smirnov distance) and we want to find one or more distributions of stars that match the hand-drawn distribution. I keep getting hung up on the fact that the hand-drawn constellation has no notion of scale …

Topic: pattern-recognition distribution distance similarity

Category: Data Science

Co-joining multi-peak histograms

Jericho Jones

2022年6月4日 00:10

I am analysing a bunch of data files which represent responsiveness of cells to addition of a drug. If a drug is not added, cell responds normally, if it is added, it shows abnormal patterns: , . We decided to analyse this using an amplitude histogram, in order to distinguish between a change in amplitude and in change of a probability of elliciting the binary response. What we get with file 1 is : So we fit a pdf on …

Topic: variance distribution gaussian multiclass-classification

Category: Data Science

How to make a gaussian distribution in python considering mean. variance. skewness and kurtosis?

rb173

2022年6月3日 10:04

np.random.normal(mean,sigma,size) allows to create a gaussian distribution based only on mean and variance. I want to create a distribution based on function_name(mean,sigma,skew,kurtosis,size). I tried scipy.stats.gengamma but I don't understand how to use it. It takes 2 parameters - a,c and creates a distribution. But it is difficult to interpret for what values of a & c, the function will give a particular value of skewness and kurtosis. Can anyone explain how to use gengamma or any other way to create …

Topic: distribution scipy python

Category: Data Science

How to combine data having similar distribution?

vignesh_md

2022年6月1日 01:05

I have a collection of time series data with data points of around 2 years of daily data. I am thinking of a way to increase the number of data points in it so that the neural network gets a better understanding of the fluctuations in the data. I am suggesting a hypothesis where I try to cluster similar time-series data following similar distribution, in order to increase the number of data points fed into the neural network. Is this …

Topic: forecasting distribution rnn time-series machine-learning

Category: Data Science

Analysis of probability distribution of each features and Machine Learning

Hing

2022年5月23日 04:07

While I know that probability distributions are for hypothesis testing, confidence level constructions, etc. They definitely have many roles in statistical analysis. However, it is not obvious to me now how probability distributions come in handy for machine learning problems? In ML algorithms, they are expected to automatically pick up distributions from dataset. I wonder if there are any places of probability distributions in better solving ML problem? Shortly put, how could statistical techniques related to probability distributions can benefit …

Topic: distribution probability statistics machine-learning

Category: Data Science

Finding the worst affected industry due to COVID in terms of unemployment

NAS

2022年5月19日 01:03

My goal is to find the worst affected industries from COVID—19 in terms unemployment. In terms of the data I will use for this task, I have a time series county-wise unemployment rate data of each month and business distribution data. Business distribution data contains number of establishments in each county by their respective industries. (Manufacturing -121, Accommodation and Food Services -564, Construction-32 etc.) Unemployment rate data gives monthly unemployment rate in each county. From this data, what would your …

Topic: data-science-model distribution correlation

Category: Data Science

How to compare distribution of 2 continuous variable datasets

Swara

2022年5月17日 11:20

i want to compare 2 datasets and check for their similarity. I have tried statistical tests like ks test , z test but they gave a p value of 0.0 for most columns. I then read ks test won't work because the dataset size is huge and it will exaggerate even slight differences. Then I tried bhattacharya distance, helinger distance but the probability values are coming 0.01 (which is correct since it is continuous variable) . I am trying to …

Topic: distribution

Category: Data Science

Distribution Shift vs Transfer Learning

Carlos Mougan

2022年5月6日 13:34

Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem [1] Distribution Shift The conditions under which the system was developed will differ from those in which we use the system. [2] I consider there is no difference between distribution shift and dataset shift. But between transfer learning and distribution shift? What are the differences? Can we say that transfer …

Topic: transfer-learning distribution statistics machine-learning

Category: Data Science

Check if distribution per week is the same

Ismail

2022年5月1日 16:07

I have sales by customer (b2b) and by date. I want to check if the distribution per day inside weeks remains the same from week to week. Initial dataset Customer Date Sales Alpha 2019-02-23 527 Beta 2019-02-23 642 Alpha 2019-02-24 776 ... ... ... Beta 2021-07-28 1236 I transformed it into Customer Week Monday Tuesday Wednesday Thursday Friday Saturday Sunday Alpha 201906 0.2202 0.15799 0.178202 0.160449 0.1528 0.130214 0.000067 Beta 201906 0.20573 0.183979 0.182207 0.179824 0.140596 0.107601 0.000061 ... ... …

Topic: hypothesis-testing distribution statistics

Category: Data Science

A/B Testing (Binomial Distribution vs Random Distribution)

DD.

2022年4月28日 17:04

When performing an A/B test for the number of clicks for users viewing (each view is an impression) two variants of an ad, a binomial distribution can be assumed where each variant has a constant click-through rate. Example: Two Ads, -> Ad one has 1000 impressions and 20 clicks, CTR is 2%; -> Ad two has 900 impressions and 30 clicks, CTR is 3.3%. Test whether there is a difference between Click Through Rate (CTR) between Ads one and two. …

Topic: distribution descriptive-statistics ab-test statistics

Category: Data Science

the mean and standard deviation aren't the same as those of the input data i provided after sampling

codebreaker12

2022年4月25日 14:21

I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …

Topic: distribution probability scipy sampling python

Category: Data Science

Getting a balanced sample across many variables

user

2022年4月24日 11:04

Let’s say each element in my population has several attributes. Let’s call then A, B, C, D, E, F. Let’s say, for simplicity, each attribute has 10 values (but could be any number between 2 and 30). Now I want to get a sample such that the distribution is the same across all features. So for example if the whole population has about 15% of people in feature A with value 1, my sample should be the same. What should …

Topic: multivariate-distribution distribution sampling statistics

Category: Data Science

derivation for expected value for variance

Aj_MLstater

2022年4月18日 20:06

Hi Im taking a course about probability distribution in datascience and below is derivation of the expected value for the variance Variance = expected value of the squared difference from mean for any value. But generally, variance is just the difference between the value and its mean. Why are we squaring and adding the expected value symbol? $$\sigma^2 = E((Y - \mu)^2) = E(Y^2) - \mu^2$$ For the first step in derivation, why do we multiply the summation of $p(x)$ …

Topic: variance distribution probability

Category: Data Science

Binomial family in logistic regression

Tejas Bawaskar

2022年4月17日 16:04

I was asked in an interview why do we use the binomial distribution in logistic regression and how is it related to the class that we are predicting? Could anyone explain, without any mathematical equations, why do we use binomial instead on any other distribution?

Topic: distribution logistic-regression classification

Category: Data Science

How to properly setup jensen_shannon_divergence and infinity norm in tensorflow data validation for skew and drift checks

datapug

2022年4月11日 20:27

Tensorflow data validation offers the capability of checking data skew and drift and the documentation also mention that "Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation" How can one specify an initial reasonable jensen_shannon_divergence threshold (or infinity_norm one for categorical features)? Is there some python package/utility/code that one can leverage on a given data set feature to compute a reasonable threshold? If not, what is recommended to proper conduct the experiment and find the …

Topic: data-drift distribution tensorflow python

Category: Data Science

How to explain this coorelation between features

Rus Pylypyuk

2022年4月10日 23:51

Can somebody please help how to explain this correlation between features, as it does not have linear coorelation, but still seems to have somewhat coorelation. Here is the screenshoot:

Topic: distribution correlation

Category: Data Science

Generating a set of different scenarios based on some initial observations

Dimits

2022年4月4日 00:03

I have a in my hands 3 different time series which model 3 different scenarios (base, downside, upside). Every of this time-series depends on a set of 11 different attributes, which take values for different time intervals. Most of the different features of the input are highly correlated. There is also a (cdf) probability function which defines how probably every scenario is (every quintile), for every point in time. In my case, I want to create more input data based …

Topic: data-science-model distribution sampling time-series python

Category: Data Science

How to measure statistical similarity or discrepancy between a dataset and a distribution?

nick

2022年4月3日 07:11

Is any way to measure statistical similarity or discrepancy between a dataset and a distribution? I have do some research, but find most of method are intended to describe discrepancy between data and data, or between distribution and distribution. That is to say, they always are measure the same kind of thing. What I looking for is a method can measure discrepancy between a dataset and a distribution. It would be nice if there were a corresponding method that easy …

Topic: mathematics distribution statistics

Category: Data Science

Normal vs Uniform Distribution for machine learning

Michael Pulis

2022年3月31日 02:04

I have a dataset that follows Zipf's law such that the majority of the values are concentrated at one end, with the remaining items containing a very small percentage. Training on the dataset as is would introduce a bias, and thus I was thinking of restructuring the data to fall into buckets. Thus my model would be a multi-class classification model, rather than a regression model (I am training a NN). My question is whether I should draw up the …

Topic: distribution data machine-learning

Category: Data Science

Is it possible to find an internal event in a Classification between two classes?

Sanu

2022年3月30日 11:08

I am very new to the machine learning area. So my question might be trivial I have two classes $U, V$ of binary vectors. In the training phase, I use $u_1,\ldots, u_{1000}$ from $U$ class and $v_1, \ldots, v_{1000}$ from $V$. In the testing phase, I have to determine whether a vector is coming from $U$ or $V$? How can we do that with good accuracy? Also, can we find internal event by which ML makes the clasification?

Topic: distribution machine-learning

Category: Data Science

About