Random Forest Classifier Output

Used a RandomForestClassifier for my prediciton model. But the output printed is either 0 or in decimals. What do I need to do for my model to show me 0 and 1's instead of decimals? Note: used feature importance and removed the least important columns,still the accuracy is the same and the output hasn't changed much. Also, i have my estimators equal to 1000. do i increase or decrease this? edit: target col 1 0 0 1 output col 0.994 …
Category: Data Science

Does it make sense to scale input data with random forest regressor taking two different arrays as input?

I am exploring Random Forests regressors using sklearn by trying to predict the returns of a stock based on the past hour data. I have two inputs: the return (% of change) and the volume of the stock for the last 50 mins. My output is the predicted price for the next 10 minutes. Here is an example of input data: Return Volume 0 0.000420 119.447233 1 -0.001093 86.455629 2 0.000277 117.940777 3 0.000256 38.084008 4 0.001275 74.376315 ... 45 …
Category: Data Science

When to use Random Forest over SVM and vice versa?

When would one use Random Forest over SVM and vice versa? I understand that cross-validation and model comparison is an important aspect of choosing a model, but here I would like to learn more about rules of thumb and heuristics of the two methods. Can someone please explain the subtleties, strengths, and weaknesses of the classifiers as well as problems, which are best suited to each of them?
Category: Data Science

Isolation Forest Score Function Theory

I am currently reading this paper on isolation forests. In the section about the score function, they mention the following. For context, $h(x)$ is definded as the path length of a data point traversing an iTree, and $n$ is the sample size used to grow the iTree. The difficulty in deriving such a score from $h(x)$ is that while the maximum possible height of iTree grows in the order of $n$, the average height grows in the order of $log(n)$. …
Category: Data Science

Determining increments for aggregated time series data to determine impact of individual features

I'm working with a data source that provides itemised transactions, which I am aggregating into 1 hour blocks to determine a 'rate per hour' as the dependent or target variable - i.e. like a time series. So far I've looked at Logistic Regression, Random Forest Regressor and Gradient Boosting Regressor and got reasonable results - but am really trying to determine the weighting/ impact of the independent variables, to see which have the biggest impact on the DV. Would there …
Category: Data Science

Any way to represent a random forest regressor model?

currently doing some EDA into a random forest regressor that was built; there seems to be observations where the model prediction is off. what library can i use to visualise the representation of the random forest for me to understand better how the model splits for each node, etc. the model is built in pyspark (pyspark.ml.RandomForestRegressor)
Category: Data Science

Decision trees for anomaly detection

Problem From what I understand, a common method in anomaly detection consists in building a predictive model trained on non-anomalous training data, and perform anomaly detection using the error of the model when predicting on the observed data. This method requires the user to identify non-anomalous data beforehand. What if it's not possible to label non-anomalous data to train the model? Is there anything in literature that explain how to overcome this issue? I have an idea, but I was …
Category: Data Science

approach for predicting machine failure using maintenance history

I have been struggling with this problem for a while now and I finally decided to post a question here to get some help. The problem i'm trying to solve is about predictive maintenance. Specifically, a system produces 2 kinds of maintenance messages when it runs, a basic-msg and a fatal-msg, a basic message indicates that there is a problem with the system that needs to be checked (its not serious), a fatal-msg on the other hand signals that the …
Category: Data Science

Plotting decision boundary from Random Forest model for multiclass MNIST dataset

I am using the MNIST dataset with 10 classes (the digits 0 to 9). I am using a compressed version with 49 predictor variables(x1,x2,...,x49). I have trained a Random Forest model and have created a Test data set, which is a grid, on which I have used the trained model to generate predictions as class probabilities as well as the classes. I am trying to generalise the code here that generates a decision boundary when there are only two outcome …
Category: Data Science

How do I design a random forest split with a "not sure" category?

Let's say I have data with two target labels, A and B. I want to design a random forest that has three outputs: A, B and Not sure. Items in the Not sure category would be a mix of A and B that would be about evenly distributed. I don't mind writing the RF from scratch. Two questions: What should my split criterion be? Can this problem be reposed in a standard RF framework?
Category: Data Science

How does ExtraTrees (Extremely Randomized Trees) learn?

I'm trying to understand the difference between random forests and extremely randomized trees (https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) I understand that extratrees uses random splits and no bootstrapping, as covered here: https://stackoverflow.com/questions/22409855/randomforestclassifier-vs-extratreesclassifier-in-scikit-learn The question I'm struggling with is, if all the splits are randomized, how does a extremely randomized decision tree learn anything about the objective function? Where is the 'optimization' step?
Category: Data Science

Does it make sense to use target encoding together with tree-based models?

I'm working on a regression problem with a few high-cardinality categorical features (Forecasting different items with a single model). Someone suggested to use target-encoding (mean/median of the target of each item) together with xgboost. While I understand how this new feature would improve a linear model (or GMM'S in general) I do not understand how this approach would fit into a tree-based model (Regression Trees, Random Forest, Boosting). Given the feature is used for splitting, items with a mean below …
Category: Data Science

In Python, how can I transfer/remove duplicate columns from one dataset, such that the rows and columns of all datasets would be equal?

So I've been trying to improve my Random Decision Tree model for the Titanic Challenge on Kaggle by introducing a Validation Dataset, and now I encounter this roadblock, as shown by the images below: Validation Dataset Test Dataset After inspecting these datasets using the .info function, I've found that the Validation Dataset contains 178 and 714 non-null floats, while the Test Dataset contains an assorted 178 and 419 non-null floats and integers. Further, the Datasets contain duplicate rows, which I …
Category: Data Science

Interpreting the variance of feature importance outputs with each random forest run using the same parameters

I noticed that I am getting different feature importance results with each random forest run even though they are using the same parameters. Now, I know that a random forest model takes observations randomly which is causing the importance levels to vary. This is especially shown for the less important variables. My question is how does one interpret the variance in random forest results when running it multiple times? I know that one can reduce the instability level of results …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.