I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software). Predictors have following …
I am new to Machine Learning and started solving the Titanic Survivor problem on Kaggle. While solving the problem using Logistic Regression I used various models having polynomial features with degree $2,3,4,5,6$ . Theoretically the accuracy on training set should increase with degree however it started decreasing post degree $2$ . The graph is as per below
I have a dataset where each row is a sample and each column is a binary variable. The meaning of $X_{i, j} = 1$ is that we've seen feature $j$ for sample $i$. $X_{i, j} = 0$ means that we haven't seen this feature but we might will. We have around $1000$ binary variables and around $200k$ samples. The target variable, $y$ is categorical. What I'd like to do is to find subsets of variables that precisely predict some $y_k$. …
From a conceptual standpoint I understand the trade off involved with the ROC curve. You can increase the accuracy of true positive predictions but you will be taking on more false positives and vise versa. I wondering how one would target a specific point on the curve for a Logistic Regression model? Would you just raise the probability threshold for what would constitute a 0 or a 1 in the regression? (Like shifting at what probability predictions start to get …
I have a data set labelled with a binary classes. I calculated the principal components from the data, then made the PC transformation. The goal is to find an optimal number of PCs so that the binary classification accuracy is good enough. I've learned a binary classifier sklearn.linear_model.LogisticRegressionCV (default parameters) on the PC-transformed data. The number of PCs was the (hyper-)parameter and it was varied. I cannot interpret the resulting Accuracy v. #PCs graph, why is it so strange? For …
I'm working with a data source that provides itemised transactions, which I am aggregating into 1 hour blocks to determine a 'rate per hour' as the dependent or target variable - i.e. like a time series. So far I've looked at Logistic Regression, Random Forest Regressor and Gradient Boosting Regressor and got reasonable results - but am really trying to determine the weighting/ impact of the independent variables, to see which have the biggest impact on the DV. Would there …
Log-odds has a linear relationship with the independent variables, which is why log-odds equals a linear equation. What about log of probability? How is it related to the independent variables? Is there a way to check the relationship?
I would like to build a simple sentiment analysis classifier using logistic regression. I downloaded a list of positive and negative words from cs.uic.edu. There are more than 6000 words both positive and negative. Linear Classifier has the form: (Wikipedia Reference) $$\sum wj*xj$$ where $w$ is the weight of the vector $x$. So for example, if the weight of vector awesome is 3, then in the following sentence: Food is awesome and music is awesome. according to the formula, it …
I am creating a neural network simple architecture. But I keep getting NAN in result, cant figure out why, below is my code. import pandas from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from keras.utils import np_utils from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from collections import Counter from sklearn.metrics import classification_report, confusion_matrix from sklearn.preprocessing import StandardScaler #from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from tensorflow.keras …
I would like to ask about the theoretical approach of using Logistic Regression for customer data and more specifically Churn Prediction (in BigQuery and Python). I have my customer data for an online shop and I would like to predict if the customer will churn based on some characteristics. I have created my dataset and the Churn label (based on the hypothesis that if the customer hasn't bought something in the last year then it is assumed that the customer …
I am trying to self-implement a logistic regression algorithm to do some self-learning but I am having a bit of trouble with achieving similar accuracy to the logistic regression of sklearn. Here is the code I am using (the dataset I am using is the titanic 'training.csv' dataset from kaggle which you can download here if you want to test this out yourself.) import numpy as np import random import matplotlib.pyplot as plt #%matplotlib inline def cost(X, Y, W): """ …
I have the data below: I want to explain the relationship between 'Milieu' who has two factors, and 'DAM'. As you may notice, the blue population's included in the red population. Can I apply a logistic regression?
In the context of multi-regression, I am wondering if there is a way to decompose $$VIF_i = 1/(1-R_i^2)$$ where $R_i^2$ is the r squared obtained from the regression of dependent variable = i and independent variables are all other factors. I want to decompose $VIF_i$ or $R_i^2$ into individual factors to see how much each individual factor contributes to the $VIF_i$ or $R_i^2$ Someone recommended using the square of partial correlation coefficient and that value is linearly related to $R_i^2$. …
I got this strange behavior when deploying my logistic regression trained in scikit-learn into production. I trained the model on my own machine and stored it in form of .pickle. I use the same set of data for both locally and on server side (with docker) generating four columns for each sample in this binary classification problem: probability_of_class_0, probability_of_class_1, y_true, y_predict; where y_true and y_hat refer to the true label and the predicted label respectively for that sample row/record. And …
Given a multi class logisitic classifier $f(x)=argmax(softmax(Ax + \beta))$, and a specific class of interest $y$, is it possible to construct a binary logistic classifier $g(x)=(\sigma(\alpha^T x + b) > 0.5)$ such that $g(x)=y$ if and only if $f(x)=y$?
Given the relatively simple form of the model of standard logistic regression. I was wondering if there is an exact calculation of shap values for logistic regressions. To be clear I am looking for a closed formula depending on features ($X_i$) and coefficients ($\beta_i$) to calculate Shapley values and their corresponding importance.
Here is my understanding of the relation between MLE & Gradient Descent in Logistic Regression. Please correct me if I'm wrong: 1) MLE estimates optimal parameters by taking the partial derivative of the log-likelihood function wrt. each parameter & equating it to 0. Gradient Descent just like MLE gives us the optimal parameters by taking the partial derivative of the loss function wrt. each parameter. GD also uses hyperparameters like learning rate & step size in the process of obtaining …
Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …
Fine tuning is a concept commonly used in deep learning. We may have a pre-trained model and then fine-tune it to our specific task. Does that apply to simple models, such as logistic regression? For example, let's say I have a dataset with attribute variables of an animal and I want to classify whether or not it is a mammal or not. The labels on that dataset are only "mammal"/"not mammal". I then train a logistic regression model for this …