Used a RandomForestClassifier for my prediciton model. But the output printed is either 0 or in decimals. What do I need to do for my model to show me 0 and 1's instead of decimals? Note: used feature importance and removed the least important columns,still the accuracy is the same and the output hasn't changed much. Also, i have my estimators equal to 1000. do i increase or decrease this? edit: target col 1 0 0 1 output col 0.994 …
I have a dataset that contains some user-specific detials like gender, age-range, region etc. and also the behavioural data which contains the historical click-through-rate (last 3 months) for different ad-types shown to them. Sample of the data is shown below. It has 3 ad-types i.e. ecommerce, auto, healthcare but the actual data contains more ad-types. I need to build a regression model using XGBRegressor that can tell which ad should be shown to a given new user in order to …
I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below row_to_show = 20 data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired data_for_prediction_array = data_for_prediction.values.reshape(1, -1) rf_boruta.predict_proba(data_for_prediction_array) explainer = shap.TreeExplainer(rf_boruta) # Calculate Shap values shap_values = explainer.shap_values(data_for_prediction) shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show]) This generated the plot as …
I have manually created a random data set around some mean value and I have tried to use gradient descent linear regression to predict this simple mean value. I have done exactly like in the manual and for some reason my predictor coefficients are going to infinity, even though it worked for another case. Why, in this case, can it not predict a simple 1.4 value? clear all; n=10000; t=1.4; sigma_R = t*0.001; min_value_t = t-sigma_R; max_value_t = t+sigma_R; y_data …
I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID variable one last variable that has sales value, 0 for no sale and bill amount for sale Using this information, I want to understand which segments of my customers to market. I am using R for code but that's not relevant here. :) I am confused about …
I'm working on forecasting daily volumes and have used time series model to check for data stationarity. However, I'm strugging at forecasting data with 90% accuracy. Right now variation is extremely high and I'm just unable to bring it down. I've used log method to transform my data. Please find the link to folder below which contains ipynb and csv files: https://drive.google.com/drive/folders/1QUJkTucLPIf2vjo2mRmoBU6be083dYpQ?usp=sharing Any help will be highly appreciable Thanks, Rahul
Suppose I have data with two independent variable $X_1$, $X_2$ and one dependent variable say $y$, as follows: $X_1$: $x_{1,1}$, $x_{1,2}$ , $x_{1,3}$ $X_2$: $x_{2,1}$, $x_{2,2}$, $x_{2,3}$ $y$: $y_1$, $y_2$, $y_3$ I built some Machine learning model which is good . Now I want to generate predictions not just for test data but for all possible combinations of test data for example, if our test data looks like $X_1$: $a$, $b$, $c$ $X_2$: $p$, $q$, $r$ then I want predictions …
I am building a model to predict, say house prices. Within my data I have sales and rentals. The Y variable is the price of either the sales or rentals. I also have a number of X variables to predict Y, such as number of bedrooms, bathrooms, meters squared etc. I believe that the model will firstly make a split on the variable "sales" vs "rentals" as this would reduce the loss function - RMSE - the most. Do you …
I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like: JobID Manager City Design ClientType TaskDuration a1 George Brisbane BigKahuna Personal 10 a2 George Brisbane SmallKahuna Business 15 a3 George Perth BigKahuna Investor 7 Thus far, my model has been relatively basic, following these basic steps: …
I have several groups of features that I'd like to test against independent variables. The idea is to find which groups tend to be associated with a specific value of an independent variable. Let's take the following example where s are samples, f are features, i are independent variables associated with each s. s1 s2 s3 s4 .... f1 0.3 0.9 0.7 0.8 f2 ... f3 ... f4 ... f5 ... i1 low low med high i2 0.9 1.6 2.3 …
I already referred these posts here and here. I also posted here but since there is no response, am posting here. Currently, I am working on customer segmentation using their purchase data. So, my data has below info for each customer Based on the above linked posts I see that for clustering, we have to scale the variables if they are in different units etc. But if I scale/normalize all of them to uniform scale, wouldn't I lose the information …
I'm trying to put together a script that classifies comments into either adequate or inadequate. I put a question up here earlier with all my code, but I think I've isolated the problem down into the setup of the model, so I deleted that one, and hopefully this is more streamlined and easy to follow. The example i'm trying to follow is the classic IMDB comment, where the comments are either positive or negative, but again in my instance, adequate …
I am working on sales forecasting problem.I am able to provide data about which items got sold and not sold to the algorithm.How to provide algorithm information about items that are not present in the store.Is there any way we could encode this information in data or any other algorithms accepts this kind of information.Currently, I am using Neural Networks and Random Forest to forecast Sales.
I have multiple columns for loan installment repayment. As there is a field for month of repayment, I want to predict if the customer is going to pay next month's installment or not. As I have multiple variables and target variable as installment paid (Y/N), despite repayment being dependent on time variable, i.e., installments paid in past months, I'm looking to solve this problem with time series classification. Any references will be appreciated.
I am trying to build a regression tree with 70 attributes where the business team wants to fix the first two levels namely country and product type. To achieve this, I have two proposals: Build a separate tree for each combination of country and product type and use subsets of the data accordingly and pass on to respective tree for prediction. Seen here in comments. I have 88 levels in country and 3 levels in product type so it will …
I would like to detect anomalies for univariate time series data. Most examples on internet show that, after you predict the model, you calculate a threshold for the training data and a MAE test loss and compare them to detect anomalies. So I am thinking is this the correct way of doing it? Shouldn't it be a different threshold value for each data? Also, why do all of the examples only compute MAE loss for anomalies?
I am using the MNIST dataset with 10 classes (the digits 0 to 9). I am using a compressed version with 49 predictor variables(x1,x2,...,x49). I have trained a Random Forest model and have created a Test data set, which is a grid, on which I have used the trained model to generate predictions as class probabilities as well as the classes. I am trying to generalise the code here that generates a decision boundary when there are only two outcome …
Let's say that I have a small or medium sized dataset of images, say 50,000. I use transfer learning to train a deep learning classification model. Call this model A. Model A is deemed to have good enough performance to be deployed. I deploy model A to a production environment where many users are able to consume the service by sending an image to an endpoint and receiving back the predicted class. Now lets say the service becomes very popular, …
I have a dataset with following data format: 3 -> a -> b -> c -> d -> ikd a -> c -> 3 -> dk -> 2 -> l2i Each row represents a path from start to end. Let's take the first row as an example. The start point is 3 and the endpoint is ikd. I have millions of rows like that. And each row may have a different length. What I want to do is let users …
Here's the thing, I have an imbalanced data and I was thinking about using SMOTE transformation. However, when doing that using a sklearn pipeline, I get an error because of missing values. This is my code: from sklearn.pipeline import Pipeline # SELECAO DE VARIAVEIS categorical_features = [ "MARRIED", "RACE" ] continuous_features = [ "AGE", "SALARY" ] features = [ "MARRIED", "RACE", "AGE", "SALARY" ] # PIPELINE continuous_transformer = Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("scaler", StandardScaler()), ] ) categorical_transformer = Pipeline( steps=[ …