np.random.normal(mean,sigma,size) allows to create a gaussian distribution based only on mean and variance. I want to create a distribution based on function_name(mean,sigma,skew,kurtosis,size). I tried scipy.stats.gengamma but I don't understand how to use it. It takes 2 parameters - a,c and creates a distribution. But it is difficult to interpret for what values of a & c, the function will give a particular value of skewness and kurtosis. Can anyone explain how to use gengamma or any other way to create …
I am trying to plot a Dendrogram to cluster data but this error is stopping me. My datea is here. I first chose columns to work with: df_euro = pd.read_csv('https://assets.datacamp.com/production/repositories/655/datasets/2a1f3ab7bcc76eef1b8e1eb29afbd54c4ebf86f2/eurovision-2016.csv') samples = df_euro.iloc[:, 2:7].values[:42] country_names = df_euro.iloc[:, 1].values[:42] # Calculate the linkage: mergings mergings = linkage(samples , method = 'complete') # Plot the dendrogram dendrogram( mergings, labels = y, leaf_rotation = 90, leaf_font_size = 6 ) plt.show() But I'm getting this error which I can't understand. I googled it and …
I have N=21 objects and each one has about 80 possible not NaN descriptors. I carried out a hierarchical clustering on the objects and I obtained this dendrogram. I want some kind of 'confidence' index for the dendrogram or for each node. I saw many dendrograms with Bootstrap values (as far as I understand it is the same as Monte Carlo Cross-Validation, but I might be wrong), and i think that in my case they could be used as well. …
I am trying to duplicate this papers feature engineering for user activity. They take 14 days of accumulated user activity and keep the parameters (2 parameters) that fit a sigmoid to it. I would like to do the same except with 7 days of activity. http://hanj.cs.illinois.edu/pdf/kdd18_cyang.pdf They use the formula below and keep the parameters x0 and k as features. from scipy.optimize import curve_fit import numpy as np def sigmoid(x, x0, k): y = 1 / (1 + np.exp(-k*(x-x0))) return …
I have a log-normal mean and a standard deviation. after i converted them to the underlying normal distribution's parameters mu and sigma, I sampled from the log-normal distribution however when i take the mean and standard deviation of this sampled data i don't get the results i plugged in at first. This only happens when the log-normal mean is way smaller than the log-normal standard deviation otherwise it works. how do i prevent this from happening and get the input …
I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
I am trying to generate a sample of random numbers from a custom distribution $$ p(x) = x^{n}e^{-xtn}. $$ After reading the tutorial on scipy's website, I wrote a subclass which I called kbayes: class kbayes(rv_continuous): def _pdf(self, x, t, n): p = x**n * np.exp(-t*n*x) s = np.sum(p) return p/s The line s=np.sum(p) is there to normalize the distribution. The pdf seems to be ok when I check it on some numbers: running the following code ks = np.logspace(-5, …
I am interested in graph problems like 2-color, max-clique, stable sets, etc but the documentation for scipy.optimize.anneal seems to be for ordinary functions. How would one apply this library towards graph formulations?
I need to calculate the probability of my random variable being $\le 0$. It's a binomial distribution, $10000$ trials, probability of success is $\frac{10}{19}$ (roughly $0.53$). How do I properly use the scipy.stats.binom.cdf() to do that? I've tried the following: stats.binom(10000, a).cdf(0) But it gives me an answer $0$. I feel like I might be missing something about the formula itself.
I am tryint to trying to train a Bayesian NN and at some point I need to compute log-likelihoods for some data points, according to a multivariate diagonal gaussian distribution with parameters (mu, sigma). I have 2 problems: I don't know the size of the values in advance (note that I am guaranteed that 'values', 'mu' and 'rho') are the same size, but they could either be 1D or 2D, which forces me to have an ugly if statement. Ideally …
I have a huge dataset of German words and their frequency in a text corpus (so words like "der", "die", "das" have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word. I tried using spacy.load('de_core_news_sm') but it says it can't find the model. Other older posts don't mention anything reliable in …
I would like to know when buying a property when is better for an estate agent to get a higher fee from me compared to the seller if we get a deal with a lower amount. As an example, let's say that: the property asking price is €350k the agent fee for the buyer is 3% the agent fee for the seller is 3% All of the above could be parameters. I would like e.g. to offer €300k (50k less …
I have a simple script that performs the Savitzky-Golay filter on a toy dataset of forex prices from yahoo finance: import scipy.signal price_series = pandas.read_csv('AUDUSD=X.csv').set_index('Date')['Close'] splinal_fit = scipy.signal.savgol_filter(price_series, window_length=21, polyorder=2, deriv=0, mode='mirror') splinal_fit = pandas.Series(splinal_fit, index=price_series.index, name='fit') splinal_deriv = scipy.signal.savgol_filter(price_series, window_length=21, polyorder=2, deriv=1, axis=0, delta=1) splinal_deriv = pandas.Series(splinal_deriv, index=price_series.index, name='fit') The fit and derivatives looks broadly sensible, however, the x-axis seems skewed. Here is what I ran to plot the derivative alongside the original fit: import matplotlib.pyplot as plt mask …
Question Try to find the global optimal point of the function (reading Python for finance 2nd edition - Chapter 11. Mathematical Tools). def fm(p): x, y = p return (np.sin(x) + 0.05 * x ** 2 + np.sin(y) + 0.05 * y ** 2) scipy.optimize.basinhopping says it finds the global minimum. Find the global minimum of a function using the basin-hopping algorithm However, it looks it does not find the global optimal point. Why is this and how can make …
I am reviewing information for feature selection based in filter methods. I got info (link1, link2, link3, link4, link5) for: Numerical input, numerical output Categorical input, categorical output Numerical input, categorical output However, I'm having a hard time finding information on: Categorical input, numerical output (categorical features in a regression problem.) I would be grateful if you could pass me information about it, please, or the name of the function that could carry out this task.
I have unstructured documents from which I have to extract the information like let buyer name, seller name, expiry date, buying date etc. I had planned to use spacy(Custom entity recolonization(Followed this blog https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6)). But it seems sometimes buyer name predict as seller name and vice-versa and also sometimes got multiple predicted data wrongly in single entity when I passed whole document content. FYI.. This documents have approx 2-20 pages. so it has large content. Can someone share if we …
I need to do a chi square test of two of my dataset's categorical variables. This two variables have basically the same meaning but comes from two different sources, so my idea is to use a chi square test to see how "similar" or correlated, these two variables really are. To do so, I've written code in Python, but the p-value I get from it is exactly 0 which sounds a little strange to me. the code is: from scipy.stats …
So I'm trying to write an averaged perceptron algorithm (page 48 here for the equation) in python. Instead of storing the historical weights, I simply accumulate the weights and then multiply consistency counter, $c$, that is the variable w_accum. My implementation initially had the weight vectors and x represented as dictionaries where a feature is in the dictionary only if it's active, that was supposed to be the most efficient way I could think of. Here is that code: def …
I have a dataset with many project's monthly expendituries (cost curve), like this one: Project Date Expenditure(USD) Project A 12-2020 500 Project A 01-2021 1257 Project A 02-2021 125889 Project A 03-2021 102447 Project A 04-2021 1248 Project A 05-2021 1222 Project A 06-2021 856 Project B 01-2021 5589 Project B 02-2021 52874 Project B 03-2021 5698745 Project B 04-2021 2031487 Project B 05-2021 2359874 Project B 06-2021 25413 Project B 07-2021 2014 Project B 08-2021 2569 Using python, I …