predictive-modeling

Random Forest Classifier Output

Pavan

2022年6月4日 16:42

Used a RandomForestClassifier for my prediciton model. But the output printed is either 0 or in decimals. What do I need to do for my model to show me 0 and 1's instead of decimals? Note: used feature importance and removed the least important columns,still the accuracy is the same and the output hasn't changed much. Also, i have my estimators equal to 1000. do i increase or decrease this? edit: target col 1 0 0 1 output col 0.994 …

Topic: prediction random-forest predictive-modeling machine-learning

Category: Data Science

What should be the target vairable in CTR maximization problem?

p0712

2022年6月4日 16:11

I have a dataset that contains some user-specific detials like gender, age-range, region etc. and also the behavioural data which contains the historical click-through-rate (last 3 months) for different ad-types shown to them. Sample of the data is shown below. It has 3 ad-types i.e. ecommerce, auto, healthcare but the actual data contains more ad-types. I need to build a regression model using XGBRegressor that can tell which ad should be shown to a given new user in order to …

Topic: xgboost regression predictive-modeling

Category: Data Science

How to export shap waterfall values to dataframe?

The Great

2022年6月1日 19:37

I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below row_to_show = 20 data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired data_for_prediction_array = data_for_prediction.values.reshape(1, -1) rf_boruta.predict_proba(data_for_prediction_array) explainer = shap.TreeExplainer(rf_boruta) # Calculate Shap values shap_values = explainer.shap_values(data_for_prediction) shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show]) This generated the plot as …

Topic: neural-network classification python predictive-modeling machine-learning

Category: Data Science

gradient descent diverges extremely

user94586

2022年6月1日 14:04

I have manually created a random data set around some mean value and I have tried to use gradient descent linear regression to predict this simple mean value. I have done exactly like in the manual and for some reason my predictor coefficients are going to infinity, even though it worked for another case. Why, in this case, can it not predict a simple 1.4 value? clear all; n=10000; t=1.4; sigma_R = t*0.001; min_value_t = t-sigma_R; max_value_t = t+sigma_R; y_data …

Topic: matlab gradient-descent predictive-modeling algorithms machine-learning

Category: Data Science

How to decide who to market? Clustering or Decision Tree?

Data Enthusiast

2022年6月1日 05:03

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …

Topic: decision-trees marketing classification predictive-modeling clustering

Category: Data Science

Need help on Time Series ARIMA Model

Rajan

2022年5月30日 19:01

I'm working on forecasting daily volumes and have used time series model to check for data stationarity. However, I'm strugging at forecasting data with 90% accuracy. Right now variation is extremely high and I'm just unable to bring it down. I've used log method to transform my data. Please find the link to folder below which contains ipynb and csv files: https://drive.google.com/drive/folders/1QUJkTucLPIf2vjo2mRmoBU6be083dYpQ?usp=sharing Any help will be highly appreciable Thanks, Rahul

Topic: machine-learning-model time-series pandas predictive-modeling

Category: Data Science

Model prediction on meshgrid in python

Milan Amrut Joshi

2022年5月30日 03:05

Suppose I have data with two independent variable $X_1$, $X_2$ and one dependent variable say $y$, as follows: $X_1$: $x_{1,1}$, $x_{1,2}$ , $x_{1,3}$ $X_2$: $x_{2,1}$, $x_{2,2}$, $x_{2,3}$ $y$: $y_1$, $y_2$, $y_3$ I built some Machine learning model which is good . Now I want to generate predictions not just for test data but for all possible combinations of test data for example, if our test data looks like $X_1$: $a$, $b$, $c$ $X_2$: $p$, $q$, $r$ then I want predictions …

Topic: numpy python predictive-modeling

Category: Data Science

Multiple XGBoost models or just 1 for a cetain type of category?

user113156

2022年5月28日 21:03

I am building a model to predict, say house prices. Within my data I have sales and rentals. The Y variable is the price of either the sales or rentals. I also have a number of X variables to predict Y, such as number of bedrooms, bathrooms, meters squared etc. I believe that the model will firstly make a split on the variable "sales" vs "rentals" as this would reduce the loss function - RMSE - the most. Do you …

Topic: xgboost predictive-modeling machine-learning

Category: Data Science

Model for predicting duration based on categorical data

Kadin

2022年5月28日 17:05

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …

Topic: model-selection python predictive-modeling categorical-data

Category: Data Science

Relationships between groups of features against independent variables

Sos

2022年5月28日 13:05

I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …

Topic: linear-regression statistics predictive-modeling

Category: Data Science

Interpreting cluster variables - raw vs scaled

The Great

2022年5月27日 12:35

I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …

Topic: predictive-modeling k-means clustering data-mining machine-learning

Category: Data Science

TF Keras Text Processing - Classification Model

Peter

2022年5月25日 22:01

I'm trying to put together a script that classifies comments into either adequate or inadequate. I put a question up here earlier with all my code, but I think I've isolated the problem down into the setup of the model, so I deleted that one, and hopefully this is more streamlined and easy to follow. The example i'm trying to follow is the classic IMDB comment, where the comments are either positive or negative, but again in my instance, adequate …

Topic: text-classification keras tensorflow predictive-modeling

Category: Data Science

Store's unseen items sales forecasting

RAVI TEJA M

2022年5月25日 16:03

I am working on sales forecasting problem.I am able to provide data about which items got sold and not sold to the algorithm.How to provide algorithm information about items that are not present in the store.Is there any way we could encode this information in data or any other algorithms accepts this kind of information.Currently, I am using Neural Networks and Random Forest to forecast Sales.

Topic: probability forecast statistics predictive-modeling machine-learning

Category: Data Science

Time Series Classification for loan data

SKB

2022年5月24日 23:05

I have multiple columns for loan installment repayment. As there is a field for month of repayment, I want to predict if the customer is going to pay next month's installment or not. As I have multiple variables and target variable as installment paid (Y/N), despite repayment being dependent on time variable, i.e., installments paid in past months, I'm looking to solve this problem with time series classification. Any references will be appreciated.

Topic: prediction classification time-series predictive-modeling machine-learning

Category: Data Science

Fix first two levels of decision tree?

Aravind

2022年5月24日 11:08

I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …

Topic: decision-trees predictive-modeling r machine-learning

Category: Data Science

How to compute threshold?

warriorforce

2022年5月22日 18:58

I would like to detect anomalies for univariate time series data. Most examples on internet show that, after you predict the model, you calculate a threshold for the training data and a MAE test loss and compare them to detect anomalies. So I am thinking is this the correct way of doing it? Shouldn't it be a different threshold value for each data? Also, why do all of the examples only compute MAE loss for anomalies?

Topic: keras anomaly-detection regression predictive-modeling

Category: Data Science

Plotting decision boundary from Random Forest model for multiclass MNIST dataset

user2450223

2022年5月22日 01:43

I am using the MNIST dataset with 10 classes (the digits 0 to 9). I am using a compressed version with 49 predictor variables(x1,x2,...,x49). I have trained a Random Forest model and have created a Test data set, which is a grid, on which I have used the trained model to generate predictions as class probabilities as well as the classes. I am trying to generalise the code here that generates a decision boundary when there are only two outcome …

Topic: plotting random-forest predictive-modeling

Category: Data Science

Can classification model B trained on data labeled by classification model A exceed the performance of model A?

AJV

2022年5月19日 19:43

Let's say that I have a small or medium sized dataset of images, say 50,000. I use transfer learning to train a deep learning classification model. Call this model A. Model A is deemed to have good enough performance to be deployed. I deploy model A to a production environment where many users are able to consume the service by sending an image to an endpoint and receiving back the predicted class. Now lets say the service becomes very popular, …

Topic: classification predictive-modeling

Category: Data Science

Looking for a ML algorithm to predict a path based on millions of data

Joey Yi Zhao

2022年5月18日 21:00

I have a dataset with following data format: 3 -> a -> b -> c -> d -> ikd a -> c -> 3 -> dk -> 2 -> l2i Each row represents a path from start to end. Let's take the first row as an example. The start point is 3 and the endpoint is ikd. I have millions of rows like that. And each row may have a different length. What I want to do is let users …

Topic: prediction dataset predictive-modeling machine-learning

Category: Data Science

What's the order in applying SMOTE transformation in a pipeline?

dummyds

2022年5月18日 01:58

Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …

Topic: smote sampling logistic-regression python predictive-modeling

Category: Data Science

About