why is H2O using only a part of the data?

I have this dataframe: > head(df_clas_sn) country serial_no_of_generator_1 serial_no_of_generator_2 serial_no_of_generator_3 unit_type 11 Germany XY 01 0620 ORiP 12 India XY 01 0631 ORiP 13 Germany XY 02 0683 ORiP 14 Germany XZ 02 0735 KRIT 15 England XY 03 0844 KRIT 16 Germany XZ 05 0243 ORiP position_in_unit hours_balance status_code 11 Y 2771 1 12 DE 3783 1 13 G 1267 1 14 DE 7798 1 15 G 1136 1 16 M 6197 1 with these dimensions: > dim(df_clas_sn) [1] …
Category: Data Science

How to extract the sample split (values) of decision tree leaves ( terminal nodes) applying h2o library

Sorry for a long story, but it is a long story. :) I am using the h2o library for Python to build a decision tree and to extract the decision rules out of it. I am using some data for training where labels get TRUE and FALSE values. My final goal is to extract the significant path (leaf) of the tree where the number of TRUE cases significantly exceeds that of FALSE ones. treemodel=H2OGradientBoostingEstimator(ntrees = 3, max_depth = maxDepth, distribution="bernoulli") …
Category: Data Science

Running H2O in databricks

I am trying to run H2O in databricks. However, when I do the following: hc = pysparkling.H2OContext.getOrCreate(spark) I get the following error: java.lang.AbstractMethodError Does anyone know what the problem could be?
Category: Data Science

AutoML for categorical feature encoding

I have an input dataset with more than 100 variables where around 80% of the variables are categorical in nature. While some variables like gender, country etc can be one-hot encoded but I also have few variables which have an inherent order in their values such rating - Very good, good, bad etc. Is there any auto-ML approach which we can use to do this encoding based on the variable type? For ex: I would like to provide the below …
Category: Data Science

H2O Python H2OModelSelectionEstimator

I want to try H2O's Model Selection function in Python, but cannot load the library for some reason. The following code failed: from h2o.estimators import H2OModelSelectionEstimator Error message: cannot import name 'H2OModelSelectionEstimator' from 'h2o.estimators' Other H2O libraries like H2OGeneralizedLinearEstimator worked fine for me though https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/model_selection.html
Topic: h2o glm python
Category: Data Science

multi class classification : unbalanced data - good testing results poor prediction results

I have unbalanced dataset with 11 classes where 1 one class is 30% and rest are between 5-12%. I am not a hardcore programmer so I am using the product from https://www.h2o.ai/. I used GBM and DRF and used the option to balance the classes and the training results are great (98-99% precision and recall) as per the confusion matrix however when I test it on the validation set the only class where I get decent accuracy is the class …
Category: Data Science

How can I prevent overfitting?

hope to find you well ! I am trying to build a model to classiffy customers with propensity to buy, but i cannot get rid of overfitting! My approach is the following: I have created the train dataset with unbalanced approach and have now a target 1 of 6% and a total of 6.755 rows and 252 columns. On the other hand, the test dataset has 313.587 rows and target 1 is only 34 of the cases (really low %). …
Category: Data Science

Which loss functions does h2o.gbm use by default?

the GBM implementation of the h2o package only allows the user to specify a loss function via the distribution argument, which defaults to multinomial for categorical response variables and gaussian for numerical response variables. According to the documentation, the loss functions are implied by the distributions. But I need to know which loss functions are used, and I can't find that anywhere in the documentation. I'm guessing it's the MSE for gaussian and cross-entropy for multinomial - does anybody here …
Category: Data Science

h2o much faster than neuralnet (in R)

I’m a novice to machine learning. I've been trying out different neural network implementations in R, including the neuralnet package and the deeplearning function of the h2o package. For neuralnet, the default setting is one hidden layer with one hidden neuron. With this setting, the model takes several minutes to fit to my data. In the h2o package, the default is two layers with 200 neurons each, and the model takes only a few seconds. How is this possible? Are …
Category: Data Science

H2O deep learning model performance

I am discovering H2O deeplearning and I would like to have your point of view about the performance that's performed my model on classification problem. Do you think my model is overfitting? dl_fit2 <- h2o.deeplearning(x = predictors, y = response, training_frame = train, validation_frame = valid, epochs = 200, score_validation_samples=10000, score_duty_cycle=0.025, activation = "RectifierWithDropout", hidden = c(80, 10, 80), hidden_dropout_ratios = c(0.2, 0.2, 0.2), loss = "CrossEntropy", rate=0.01, rate_annealing=2e-6, adaptive_rate = FALSE, momentum_start = 0.2, momentum_ramp = 1e7, momentum_stable = …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.