statistics

Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

CadPat

2022年6月5日 03:03

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …

Topic: bayesian classification statistics machine-learning

Category: Data Science

How to calculate lexical cohension and semantic informaticveness for a given dataset?

J Cena

2022年6月4日 14:00

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …

Topic: text-mining nlp statistics data-mining

Category: Data Science

Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

Obaid Khan

2022年6月4日 08:05

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …

Topic: statistics r k-means clustering

Category: Data Science

A clear visualization of a two-way ANOVA

Reza Norouzian

2022年6月3日 17:03

To provide a full yet simple picture of a 3-level, one-way ANOVA, I use the following visualization where variation within each group (the filled circles) and variation between the groups (black arrows) are simple to be understood. But I'm wondering if it could be possible to extend the current visualization to a 2 x 3 two-way ANOVA (adding another way with two groups to the current visualization)? (Note: the dashed vertical lines denote each group's mean)

Topic: visualization statistics r

Category: Data Science

Linear Regression bad results after log transformation

Caldass_

2022年6月3日 05:02

I have a dataset that has the following columns: The variable I'm trying to predict is "rent". My dataset looks a lot similar to what happens in this notebook. I tried to normalize the rent column and the area column using log transformation since both columns had a positive skewness. Here's the rent column and area column distribution before and after the log transformation. Before: After: I thought after these changes my regression models would improve and in fact they …

Topic: linear-regression regression statistics machine-learning

Category: Data Science

Alternating Least Squares

PavanKumar

2022年6月3日 02:01

can anyone explain difference between Alternating Least Squares(ALS) and the recommendation systems? it will be helpful if you give me an example.

Topic: python statistics machine-learning

Category: Data Science

How to generate a rule-based system based on binary data?

greenButMellow

2022年6月2日 22:38

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …

Topic: decision-trees logistic-regression classification statistics machine-learning

Category: Data Science

which statistical parameters are more useful to detect anomalies and outlier? mean max min var?

user10296606

2022年6月2日 15:01

This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an idea to which statistical parameter is more useful for it? mean max min median var or any parameters of these 151 sampling? Which parameter I should use? (I …

Topic: anomaly anomaly-detection time-series statistics machine-learning

Category: Data Science

Confusion on Outliers

exp post

2022年6月1日 03:08

I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median. My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier. So I am confused as to which one to choose. …

Topic: outlier statistics machine-learning

Category: Data Science

How to find average lag time with variance & confidence of two time series

Ramy Saad

2022年5月30日 16:06

I have two variables as time series, one a consequent of the other, I would like to find the average time delay it takes the dependent variable to act on the independent variable. Additionally, I would like to find the range of variance that is associated with the lag time and its respective confidence level. I am unsure how to go about this in a statistically valid way, but I am using Python. Currently I have used np.diff(np.sign(np.diff(df))) to isolate …

Topic: numpy variance python statistics

Category: Data Science

why zero centring of data from activation function is good for deep nural network?

Sahil Lohiya

2022年5月29日 18:50

I was reading an article that mentioned reasons why tanh is better than sigmoid and one reason was that tanh gives zero-centered data but I couldn't understand why and how it will affect our network. kindly give math light, intuitive answers.

Topic: activation-function normalization deep-learning statistics machine-learning

Category: Data Science

Is there a safe and simple way to estimate a standard deviation for a next subset?

zina

2022年5月29日 17:04

In case I receive only standard deviation from a sensor of a value $v$ (that is btw normally distributed) each 4th minute but need to provide a standard deviation $\sigma$ for each 15 minutes is there a safe way to do it. There are two things that came into my mind: 1) One and safe way is to get the mean, generate possible values using standard deviation of the 4 minute interval for the 15 minutes period (15*60 values). Calculate …

Topic: statistics

Category: Data Science

Relationships between groups of features against independent variables

Sos

2022年5月28日 13:05

I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …

Topic: linear-regression statistics predictive-modeling

Category: Data Science

Statistical learning for data-limited systems

laneyy

2022年5月27日 18:05

I'm currently conducting a review for quantitative methods being used for tropical inland fisheries. One of the major problems for modeling methods in tropical inland fisheries is the lack of data available. Fisheries assessments are difficult with widely distributed, small-scale fisheries. As many of the people living in tropical regions are subsistence fishers, they directly consume fish without any recordings of the catch. I'm trying to find statistical/mathematical modeling methods that are able to deal with data-limited systems. I do …

Topic: statistics machine-learning

Category: Data Science

uncertainties in non-convex optimization problems (neural networks)

Dave

2022年5月27日 08:06

How do you treat statistical uncertainties coming from non-convex optimization problems? More specifically, suppose you have a neural network. It is well known that the loss is not convex; the optimization procedure with any approximated stochastic optimizer together with the random weights initialization introduce some randomness in the training process, translating into different "optimal" regions reached at the end of training. Now, supposing that any minimum of the loss is an acceptable solution there are no guarantees that those minima …

Topic: uncertainty loss-function neural-network optimization statistics

Category: Data Science

What are the requirements for a word list to be used for Bayesian inference?

Dawid

2022年5月26日 06:53

Intro I need an input file of 5 letter English words to train my Bayesian model to infer the stochastic dependency between each position. For instance, is the probability of a letter at the position 5 dependent on the probability of a letter at position 1 etc. At the end of the day, I want to train this Bayesian network in order to be able to solve the Wordle game. What is Wordle? It’s a game where you guess 5 …

Topic: bayesian inference bayesian-networks statistics machine-learning

Category: Data Science

Polynomial regression with two variables. How can I find expressions to describe the coefficients?

Ant

2022年5月25日 20:59

I'm not sure if this is an appropriate place for this question, so please feel free to redirect me if it is not. I just moved it from Super User, where it seemed like there weren't many similar questions. Please also feel free to suggest tags. I'm trying to modify part of an old code. It uses regression to describe the relationship between two variables (described as "a fourth order power series in X and y"). I know very little …

Topic: multivariate-distribution matlab regression python statistics

Category: Data Science

Store's unseen items sales forecasting

RAVI TEJA M

2022年5月25日 16:03

I am working on sales forecasting problem.I am able to provide data about which items got sold and not sold to the algorithm.How to provide algorithm information about items that are not present in the store.Is there any way we could encode this information in data or any other algorithms accepts this kind of information.Currently, I am using Neural Networks and Random Forest to forecast Sales.

Topic: probability forecast statistics predictive-modeling machine-learning

Category: Data Science

Sampling methods for Text datasets (NLP)

Aaditya ura

2022年5月25日 14:07

I am working on two text datasets, one is having 68k text samples and other is having 100k text samples. I have encoded the text datasets into bert embedding. Text sample > 'I am working on NLP' ==> bert encoding ==> [0.98, 0.11, 0.12....nth] # raw text 68k # bert encoding [68000, 1024] I want to try different custom NLP models on these embeddings, but dataset large to test the model's performance quickly. To check different models quickly, the best …

Topic: classification dataset nlp statistics machine-learning

Category: Data Science

Interpreting interaction term coefficient in GLM/regression

r_noobie

2022年5月25日 13:58

I'm a psychology student and trying come up with a research plan involving GLM. I'm thinking about adding an interaction term in the analysis but I'm unsure about the interpretation of it. To make things simple, I'm going to use linear regression as an example. I'm expecting a (simplified) model like this: $$y = ax_{1} + bx_{2} + c(x_{1}*x_{2})+e$$ In my hypothesis, $x_{1}$ and $y$ are negatively correlated, and $x_{2}$ and $y$ are positiely correlated. As for correlation between $x_{1}$ …

Topic: regression glm statistics

Category: Data Science

About