Estimating class prevalence in unlabelled data after predicting labels with a binary classifier

I'm looking to get an estimate of the prevalence of 1's (i.e. the rate of positive labels) in a very large dataset that I have. However, I am hoping to report this percentage as a 95% credible interval instead of as an exact estimate of rate, taking into account the model uncertainties. These are the steps I'm hoping to perform: Train a binary classifier on labelled training data. Use a labelled test set to estimate the specificity and sensitivity of …
Category: Data Science

How to calculate lexical cohension and semantic informaticveness for a given dataset?

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned; There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain. However, the review does not include the ways to calculate/derive these measures. …
Category: Data Science

Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300. I have to use K-Means Clustering analysis …
Category: Data Science

A clear visualization of a two-way ANOVA

To provide a full yet simple picture of a 3-level, one-way ANOVA, I use the following visualization where variation within each group (the filled circles) and variation between the groups (black arrows) are simple to be understood. But I'm wondering if it could be possible to extend the current visualization to a 2 x 3 two-way ANOVA (adding another way with two groups to the current visualization)? (Note: the dashed vertical lines denote each group's mean)
Category: Data Science

Linear Regression bad results after log transformation

I have a dataset that has the following columns: The variable I'm trying to predict is "rent". My dataset looks a lot similar to what happens in this notebook. I tried to normalize the rent column and the area column using log transformation since both columns had a positive skewness. Here's the rent column and area column distribution before and after the log transformation. Before: After: I thought after these changes my regression models would improve and in fact they …
Category: Data Science

How to generate a rule-based system based on binary data?

I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …
Category: Data Science

which statistical parameters are more useful to detect anomalies and outlier? mean max min var?

This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an idea to which statistical parameter is more useful for it? mean max min median var or any parameters of these 151 sampling? Which parameter I should use? (I …
Category: Data Science

Confusion on Outliers

I am not able to distinguish outliers: when to go with the std. dev. or when we need to go with the median. My understanding on std. dev. is: if the data point is away from the mean by more than 2 std. dev., we consider that as an outlier. Similarly for the median, we say that any data point that is not in-between Q1 and Q3 is an outlier. So I am confused as to which one to choose. …
Category: Data Science

How to find average lag time with variance & confidence of two time series

I have two variables as time series, one a consequent of the other, I would like to find the average time delay it takes the dependent variable to act on the independent variable. Additionally, I would like to find the range of variance that is associated with the lag time and its respective confidence level. I am unsure how to go about this in a statistically valid way, but I am using Python. Currently I have used np.diff(np.sign(np.diff(df))) to isolate …
Category: Data Science

Is there a safe and simple way to estimate a standard deviation for a next subset?

In case I receive only standard deviation from a sensor of a value $v$ (that is btw normally distributed) each 4th minute but need to provide a standard deviation $\sigma$ for each 15 minutes is there a safe way to do it. There are two things that came into my mind: 1) One and safe way is to get the mean, generate possible values using standard deviation of the 4 minute interval for the 15 minutes period (15*60 values). Calculate …
Topic: statistics
Category: Data Science

Relationships between groups of features against independent variables

I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …
Category: Data Science

Statistical learning for data-limited systems

I'm currently conducting a review for quantitative methods being used for tropical inland fisheries. One of the major problems for modeling methods in tropical inland fisheries is the lack of data available. Fisheries assessments are difficult with widely distributed, small-scale fisheries. As many of the people living in tropical regions are subsistence fishers, they directly consume fish without any recordings of the catch. I'm trying to find statistical/mathematical modeling methods that are able to deal with data-limited systems. I do …
Category: Data Science

uncertainties in non-convex optimization problems (neural networks)

How do you treat statistical uncertainties coming from non-convex optimization problems? More specifically, suppose you have a neural network. It is well known that the loss is not convex; the optimization procedure with any approximated stochastic optimizer together with the random weights initialization introduce some randomness in the training process, translating into different "optimal" regions reached at the end of training. Now, supposing that any minimum of the loss is an acceptable solution there are no guarantees that those minima …
Category: Data Science

What are the requirements for a word list to be used for Bayesian inference?

Intro I need an input file of 5 letter English words to train my Bayesian model to infer the stochastic dependency between each position. For instance, is the probability of a letter at the position 5 dependent on the probability of a letter at position 1 etc. At the end of the day, I want to train this Bayesian network in order to be able to solve the Wordle game. What is Wordle? It’s a game where you guess 5 …
Category: Data Science

Polynomial regression with two variables. How can I find expressions to describe the coefficients?

I'm not sure if this is an appropriate place for this question, so please feel free to redirect me if it is not. I just moved it from Super User, where it seemed like there weren't many similar questions. Please also feel free to suggest tags. I'm trying to modify part of an old code. It uses regression to describe the relationship between two variables (described as "a fourth order power series in X and y"). I know very little …
Category: Data Science

Store's unseen items sales forecasting

I am working on sales forecasting problem.I am able to provide data about which items got sold and not sold to the algorithm.How to provide algorithm information about items that are not present in the store.Is there any way we could encode this information in data or any other algorithms accepts this kind of information.Currently, I am using Neural Networks and Random Forest to forecast Sales.
Category: Data Science

Sampling methods for Text datasets (NLP)

I am working on two text datasets, one is having 68k text samples and other is having 100k text samples. I have encoded the text datasets into bert embedding. Text sample > 'I am working on NLP' ==> bert encoding ==> [0.98, 0.11, 0.12....nth] # raw text 68k # bert encoding [68000, 1024] I want to try different custom NLP models on these embeddings, but dataset large to test the model's performance quickly. To check different models quickly, the best …
Category: Data Science

Interpreting interaction term coefficient in GLM/regression

I'm a psychology student and trying come up with a research plan involving GLM. I'm thinking about adding an interaction term in the analysis but I'm unsure about the interpretation of it. To make things simple, I'm going to use linear regression as an example. I'm expecting a (simplified) model like this: $$y = ax_{1} + bx_{2} + c(x_{1}*x_{2})+e$$ In my hypothesis, $x_{1}$ and $y$ are negatively correlated, and $x_{2}$ and $y$ are positiely correlated. As for correlation between $x_{1}$ …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.