Sorry for a long story, but it is a long story. :) I am using the h2o library for Python to build a decision tree and to extract the decision rules out of it. I am using some data for training where labels get TRUE and FALSE values. My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones. treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli") …
I am trying to run H2O in databricks. However, when I do the following: hc = pysparkling.H2OContext.getOrCreate(spark) I get the following error: java.lang.AbstractMethodError Does anyone know what the problem could be?
I have an input dataset with more than 100 variables where around 80% of the variables are categorical in nature. While some variables like gender, country etc can be one-hot encoded but I also have few variables which have an inherent order in their values such rating - Very good, good, bad etc. Is there any auto-ML approach which we can use to do this encoding based on the variable type? For ex: I would like to provide the below …
Is it possible to plot the deviance residuals and leverage (e.g. cook's distance) of every observation fitted in a GLM model using H2O? From H2O's documentation, seems it only calculates the sum of all deviance residuals, but cannot output the residuals for each observation.
I want to try H2O's Model Selection function in Python, but cannot load the library for some reason. The following code failed: from h2o.estimators import H2OModelSelectionEstimator Error message: cannot import name 'H2OModelSelectionEstimator' from 'h2o.estimators' Other H2O libraries like H2OGeneralizedLinearEstimator worked fine for me though https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html
I have unbalanced dataset with 11 classes where 1 one class is 30% and rest are between 5-12%. I am not a hardcore programmer so I am using the product from https://www.h2o.ai/. I used GBM and DRF and used the option to balance the classes and the training results are great (98-99% precision and recall) as per the confusion matrix however when I test it on the validation set the only class where I get decent accuracy is the class …
Does H2O's plain random forest use CART, C4.5, 5.0, or something else? I cannot find this information. sklearn's docs say they use a modified version of CART, and I assume H2O also uses something like CART.
hope to find you well ! I am trying to build a model to classiffy customers with propensity to buy, but i cannot get rid of overfitting! My approach is the following: I have created the train dataset with unbalanced approach and have now a target 1 of 6% and a total of 6.755 rows and 252 columns. On the other hand, the test dataset has 313.587 rows and target 1 is only 34 of the cases (really low %). …
the GBM implementation of the h2o package only allows the user to specify a loss function via the distribution argument, which defaults to multinomial for categorical response variables and gaussian for numerical response variables. According to the documentation, the loss functions are implied by the distributions. But I need to know which loss functions are used, and I can't find that anywhere in the documentation. I'm guessing it's the MSE for gaussian and cross-entropy for multinomial - does anybody here …
I’m a novice to machine learning. I've been trying out different neural network implementations in R, including the neuralnet package and the deeplearning function of the h2o package. For neuralnet, the default setting is one hidden layer with one hidden neuron. With this setting, the model takes several minutes to fit to my data. In the h2o package, the default is two layers with 200 neurons each, and the model takes only a few seconds. How is this possible? Are …
I am discovering H2O deeplearning and I would like to have your point of view about the performance that's performed my model on classification problem. Do you think my model is overfitting? dl_fit2 <- h2o.deeplearning(x = predictors, y = response, training_frame = train, validation_frame = valid, epochs = 200, score_validation_samples=10000, score_duty_cycle=0.025, activation = "RectifierWithDropout", hidden = c(80, 10, 80), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", rate=0.01, rate_annealing=2e-6, adaptive_rate = FALSE, momentum_start = 0.2, momentum_ramp = 1e7, momentum_stable = …