What to do when one feature has very large importance/weight?

I am new to Data Science and currently am trying to predict customers churn for a company that offers of subscription-based bookings management software. Its customers are gyms. I have a small unbalanced dataset of a historical data (False 670, True 230) with 2 numerical predictors: age(days since subscription), number of active days in the last month(days on which a customer(gym) had bookings) and 1 categorical: logo (boolean, if a customers uploaded a logo in a software). Predictors have following …
Category: Data Science

Neural network is not giving the expected output after training in Python

My neural network is not giving the expected output after training in Python. Is there any error in the code? Is there any way to reduce the mean squared error (MSE)? I tried to train (Run the program) the network repeatedly but it is not learning, instead it is giving the same MSE and output. Here is the Data I used: https://drive.google.com/open?id=1GLm87-5E_6YhUIPZ_CtQLV9F9wcGaTj2 Here is my code: #load and evaluate a saved model from numpy import loadtxt from tensorflow.keras.models import load_model …
Category: Data Science

Keras very low accuracy, saturate after few epochs while training

I am very new to the data science domain and directly jumped to TensorFlow models. I've worked on examples provided on the website before. My first time doing any project using it. I am building a Cricket Score Predictor using Keras, Tensorflow. I have a dataset of details of players in a csv containing columns - "striker", "non_striker", "bowler", "run_per_ball", "run_per_ball_avg", "ball_count". "ball_count" and "run_per_ball" are labels of the model and rest are features. I have a total of 51555rows …
Category: Data Science

Reviewing a paper - common practice

I've been asked to review a paper in which the authors compare their new model (let's call it Model A) to other models (B, C, and D), and conclude theirs is superior on some metric (I know, big surprise!). Here's the problem: in my research, my supervisors always instructed me to code up the competing models and compare my model that way. The paper I'm reviewing, by contrast, just quotes results from previous literature. To clarify, here's what I would …
Category: Data Science

ML, Statistics and Mathematics

I have just started getting my hands wet in ML and every time I try delving deeper into the concepts/code, I face the challenges of the mathematics and its cryptic notations. Coming from a Computer Science background, I do understand bit of them but majority goes tangent. Say, for example below formulae from this page - I try and really want to understand them but somehow get confused and leave it everytime. Can you please suggest how to start with …
Category: Data Science

error while running lasso.py

The following is the error code generated while running lasso.py. Can anybody help in fixing the same. Here is the code: from cvxpy import * import numpy as np import cvxopt from multiprocessing import Pool # Problem data. n = 10 m = 5 A = cvxopt.normal(n,m) b = cvxopt.normal(n) gamma = Parameter(nonneg=True) # Construct the problem. x = Variable(m) objective = Minimize(sum_squares(A*x - b) + gamma*norm(x, 1)) p = Problem(objective) # Assign a value to gamma and find the …
Category: Data Science

What's the right input for gpt-2 in NLP

I'm fine-tuning pre-trained gpt-2 for text summarization. The dataset contains 'text' and 'reference summary'. So my question is how to add special tokens to get the right input format. Currently I'm thinking doing like this: example1 <BOS> text <SEP> reference summary <EOS> , example2 <BOS> text <SEP> reference summary <EOS> , ..... Is this correct? If so, a follow-up question would be whether the max-token-length(i.e. 1024 for gpt-2) means also the concatenate length of text and reference summary? Any comment …
Category: Data Science

Unable to generate useful insights on a highly cardinal data

I'm working on CRM data, did some cleaning, encoding and ran a decision tree classifier from which i plotted a feature_importance graph From that I found that Sales person column is one of the important feature which is highly cardinal column(around 1300+ categories/sales person). Now i'm trying to generate some insights on this column with respect to target column(binary values). Would like to know in general how to create insights from such a large categorical column? P.S: Other columns are …
Category: Data Science

Handling IP addresses as features when creating machine learning model

I'm working on ML model for fraud detection, and two features that I have is sender_IP_address and receiver_IP_address. I think that this is very important feature that can not be ignored. My question is, how can I handle this kind of feature? My dataset has around 100k rows and 80 columns. I know that IP is categorical data, and that I can use OneHotEncoder (for example), but from those 100k rows, I have around 70k unique IP addresses (one IP …
Category: Data Science

ValueError: continuous is not supported

I am working on a regression problem and building a model using Random Forest Regressor but while trying to get the accuracy I am getting ValueError: continuous is not supported. train=pd.read_csv(r"C:\Users\DELL\OneDrive\Documents\BigMart data\Train.csv") test=pd.read_csv(r"C:\Users\DELL\OneDrive\Documents\BigMart data\Test.csv") df=pd.concat([train,test]) df.head() After Data Preprocessing and Visualization, I have tried to build the model : Please help with the error
Category: Data Science

Finding the worst affected industry due to COVID in terms of unemployment

My goal is to find the worst affected industries from COVID—19 in terms unemployment. In terms of the data I will use for this task, I have a time series county-wise unemployment rate data of each month and business distribution data. Business distribution data contains number of establishments in each county by their respective industries. (Manufacturing -121, Accommodation and Food Services -564, Construction-32 etc.) Unemployment rate data gives monthly unemployment rate in each county. From this data, what would your …
Category: Data Science

Combining two separate confusion matrix results from two seperate machine learning model to overall increase the True Positive accuracy

What are the steps involved if it is possible to add two confusion matrix results together to get a better final prediction. we have calculated two confusion matrixs as follows from naive bayes and the decision tree True positive totals and lessen the False negatives.
Category: Data Science

massively imbalanced data

I am dealing with time series data with +200K (every minute for 6 months)record of gas turbine I am trying to early detect the fault (0 or 1-fault). The issues with the data are: 1.the fault occurred only 5 times (by observing the sudden shutdown). make the data hugely imbalanced. 2.(unsupervised) No binary output. I used 2 of the variables as my output and used them for binary clustering (kmeans) but the result not very good as there are false …
Category: Data Science

model.fit vs model.evaluate gives different results?

The following is a small snippet of the code, but I'm trying to understand the results of model.fit with train and test dataset vs the model.evaluate results. I'm not sure if they do not match up or if I'm not understanding how to read the results? batch_size = 16 img_height = 127 img_width = 127 channel = 3 #RGB train_dataset = image_dataset_from_directory(Train_data_dir, shuffle=True, batch_size=batch_size, image_size=(img_height, img_width), class_names = class_names) ##Transfer learning code from mobilenetV2/imagenet here to create model initial_epochs = …
Category: Data Science

List of main statistics models

I am not able to find some list of main statistics models. Is is possible to devide statistics models into categories as supervised (regression,classification) x unsupervised (clustering) or is it something which is used in filed of machine learning but not for categorizing statistics model? Thank you
Category: Data Science

Two steps optimization of a credit card limit

I have a problem similar to what is on the title but not the same. The problem on the title allows me to explain the dynamics of my need. I have to determine what the optimal value is for a variable called QUOTA or LIMIT for a credit card. The goal of the model is to allow me to minimize the probability of default, given this variable and others that characterize my costumer. What is the best way to determine …
Category: Data Science

In Python, how can I transfer/remove duplicate columns from one dataset, such that the rows and columns of all datasets would be equal?

So I've been trying to improve my Random Decision Tree model for the Titanic Challenge on Kaggle by introducing a Validation Dataset, and now I encounter this roadblock, as shown by the images below: Validation Dataset Test Dataset After inspecting these datasets using the .info function, I've found that the Validation Dataset contains 178 and 714 non-null floats, while the Test Dataset contains an assorted 178 and 419 non-null floats and integers. Further, the Datasets contain duplicate rows, which I …
Category: Data Science

How to build a unbiased predictive ML model when the record of the event is less compared to the total number of records?

I am trying to build a model that will predict the communication loss of a wireless device. For now I am using RandomForestClassifier along with Device and Location as the features. I am getting both the train score and test score as 99%. So I am pretty sure the model is giving biased result. One of the reason might be because the record of communication loss events are very less compared to the the record with no communication loss Some …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.