How to make a gaussian distribution in python considering mean. variance. skewness and kurtosis?

np.random.normal(mean,sigma,size) allows to create a gaussian distribution based only on mean and variance. I want to create a distribution based on function_name(mean,sigma,skew,kurtosis,size). I tried scipy.stats.gengamma but I don't understand how to use it. It takes 2 parameters - a,c and creates a distribution. But it is difficult to interpret for what values of a & c, the function will give a particular value of skewness and kurtosis. Can anyone explain how to use gengamma or any other way to create …
Category: Data Science

Dendrogram: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I am trying to plot a Dendrogram to cluster data but this error is stopping me. My datea is here. I first chose columns to work with: df_euro = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv') samples = df_euro.iloc[:, 2:7].values[:42] country_names = df_euro.iloc[:, 1].values[:42] # Calculate the linkage: mergings mergings = linkage(samples , method = 'complete') # Plot the dendrogram dendrogram( mergings, labels = y, leaf_rotation = 90, leaf_font_size = 6 ) plt.show() But I'm getting this error which I can't understand. I googled it and …
Category: Data Science

Evaluate Dendrogram Statistical Significance

I have N=21 objects and each one has about 80 possible not NaN descriptors. I carried out a hierarchical clustering on the objects and I obtained this dendrogram. I want some kind of 'confidence' index for the dendrogram or for each node. I saw many dendrograms with Bootstrap values (as far as I understand it is the same as Monte Carlo Cross-Validation, but I might be wrong), and i think that in my case they could be used as well. …
Category: Data Science

Scipy curve_fit and method "dogbox"

I am trying to duplicate this papers feature engineering for user activity. They take 14 days of accumulated user activity and keep the parameters (2 parameters) that fit a sigmoid to it. I would like to do the same except with 7 days of activity. http://hanj.cs.illinois.edu/pdf/kdd18_cyang.pdf They use the formula below and keep the parameters x0 and k as features. from scipy.optimize import curve_fit import numpy as np def sigmoid(x, x0, k): y = 1 / (1 + np.exp(-k*(x-x0))) return …
Topic: scipy
Category: Data Science

How to sample a dataframe or numpy array with a particular interval?

I have the following dataframe : A B1 B2 B3 B4 B5 B6 B7 0 0 0 0 0 0 0 0 1 444 325 479 502 630 458 588 2 1200 1255 1101 1259 1365 1400 1100 3 2092 1764 2103 2359 2245 2397 2487 4 2586 2232 2549 2597 2628 2718 2770 5 2951 2762 2924 2757 2903 2934 2963 I want to sample the columns uniformly.For examples I want to divide the interval 0 to 1 for …
Category: Data Science

the mean and standard deviation aren't the same as those of the input data i provided after sampling

I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …
Category: Data Science

What kind of hypothesis testing in Python can be used to validate that 4 job titles are significantly different based on their skillset?

I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
Category: Data Science

Error when drawing random numbers from a custom continuous distribution using scipy.rv_continuous

I am trying to generate a sample of random numbers from a custom distribution $$ p(x) = x^{n}e^{-xtn}. $$ After reading the tutorial on scipy's website, I wrote a subclass which I called kbayes: class kbayes(rv_continuous): def _pdf(self, x, t, n): p = x**n * np.exp(-t*n*x) s = np.sum(p) return p/s The line s=np.sum(p) is there to normalize the distribution. The pdf seems to be ok when I check it on some numbers: running the following code ks = np.logspace(-5, …
Category: Data Science

How do I properly write scipy.stats.binom.cdf() details

I need to calculate the probability of my random variable being $\le 0$. It's a binomial distribution, $10000$ trials, probability of success is $\frac{10}{19}$ (roughly $0.53$). How do I properly use the scipy.stats.binom.cdf() to do that? I've tried the following: stats.binom(10000, a).cdf(0) But it gives me an answer $0$. I feel like I might be missing something about the formula itself.
Category: Data Science

Vectorize scipy.stats.norm.logpdf

I am tryint to trying to train a Bayesian NN and at some point I need to compute log-likelihoods for some data points, according to a multivariate diagonal gaussian distribution with parameters (mu, sigma). I have 2 problems: I don't know the size of the values in advance (note that I am guaranteed that 'values', 'mu' and 'rho') are the same size, but they could either be 1D or 2D, which forces me to have an ugly if statement. Ideally …
Topic: numpy scipy
Category: Data Science

Stemming/lemmatization for German words

I have a huge dataset of German words and their frequency in a text corpus (so words like "der", "die", "das" have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word. I tried using spacy.load('de_core_news_sm') but it says it can't find the model. Other older posts don't mention anything reliable in …
Category: Data Science

Find the right balance between price of a property and agent fee

I would like to know when buying a property when is better for an estate agent to get a higher fee from me compared to the seller if we get a deal with a lower amount. As an example, let's say that: the property asking price is €350k the agent fee for the buyer is 3% the agent fee for the seller is 3% All of the above could be parameters. I would like e.g. to offer €300k (50k less …
Category: Data Science

Why does the 1st derivative appear to lag the slope of the fit in Scipy's Savitzky-Golay filter?

I have a simple script that performs the Savitzky-Golay filter on a toy dataset of forex prices from yahoo finance: import scipy.signal price_series = pandas.read_csv('AUDUSD=X.csv').set_index('Date')['Close'] splinal_fit = scipy.signal.savgol_filter(price_series, window_length=21, polyorder=2, deriv=0, mode='mirror') splinal_fit = pandas.Series(splinal_fit, index=price_series.index, name='fit') splinal_deriv = scipy.signal.savgol_filter(price_series, window_length=21, polyorder=2, deriv=1, axis=0, delta=1) splinal_deriv = pandas.Series(splinal_deriv, index=price_series.index, name='fit') The fit and derivatives looks broadly sensible, however, the x-axis seems skewed. Here is what I ran to plot the derivative alongside the original fit: import matplotlib.pyplot as plt mask …
Category: Data Science

How to make scipy.optimize.basinhopping find the global optimal point

Question Try to find the global optimal point of the function (reading Python for finance 2nd edition - Chapter 11. Mathematical Tools). def fm(p): x, y = p return (np.sin(x) + 0.05 * x ** 2 + np.sin(y) + 0.05 * y ** 2) scipy.optimize.basinhopping says it finds the global minimum. Find the global minimum of a function using the basin-hopping algorithm However, it looks it does not find the global optimal point. Why is this and how can make …
Category: Data Science

Feature Selection: How to select categorical features in a regression problem

I am reviewing information for feature selection based in filter methods. I got info (link1, link2, link3, link4, link5) for: Numerical input, numerical output Categorical input, categorical output Numerical input, categorical output However, I'm having a hard time finding information on: Categorical input, numerical output (categorical features in a regression problem.) I would be grateful if you could pass me information about it, please, or the name of the function that could carry out this task.
Category: Data Science

Which is the best algorithm for entity extraction for unstructured document

I have unstructured documents from which I have to extract the information like let buyer name, seller name, expiry date, buying date etc. I had planned to use spacy(Custom entity recolonization(Followed this blog https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6)). But it seems sometimes buyer name predict as seller name and vice-versa and also sometimes got multiple predicted data wrongly in single entity when I passed whole document content. FYI.. This documents have approx 2-20 pages. so it has large content. Can someone share if we …
Category: Data Science

p-value of chi squared test is exactly 0.0

I need to do a chi square test of two of my dataset's categorical variables. This two variables have basically the same meaning but comes from two different sources, so my idea is to use a chi square test to see how "similar" or correlated, these two variables really are. To do so, I've written code in Python, but the p-value I get from it is exactly 0 which sounds a little strange to me. the code is: from scipy.stats …
Category: Data Science

Optimizing an averaged perceptron algorithm using numpy and scipy instead of dictionaries

So I'm trying to write an averaged perceptron algorithm (page 48 here for the equation) in python. Instead of storing the historical weights, I simply accumulate the weights and then multiply consistency counter, $c$, that is the variable w_accum. My implementation initially had the weight vectors and x represented as dictionaries where a feature is in the dictionary only if it's active, that was supposed to be the most efficient way I could think of. Here is that code: def …
Category: Data Science

Create Period column based on a date column where the first month is 1, second 2, etc

I have a dataset with many project's monthly expendituries (cost curve), like this one: Project Date Expenditure(USD) Project A 12-2020 500 Project A 01-2021 1257 Project A 02-2021 125889 Project A 03-2021 102447 Project A 04-2021 1248 Project A 05-2021 1222 Project A 06-2021 856 Project B 01-2021 5589 Project B 02-2021 52874 Project B 03-2021 5698745 Project B 04-2021 2031487 Project B 05-2021 2359874 Project B 06-2021 25413 Project B 07-2021 2014 Project B 08-2021 2569 Using python, I …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.