why is H2O using only a part of the data?

Question

why is H2O using only a part of the data?

Ben

2022年5月17日 08:19

I have this dataframe:

 head(df_clas_sn)
   country serial_no_of_generator_1 serial_no_of_generator_2 serial_no_of_generator_3 unit_type
11 Germany                       XY                       01                   0620      ORiP
12 India                         XY                       01                   0631      ORiP
13 Germany                       XY                       02                   0683      ORiP
14 Germany                       XZ                       02                   0735      KRIT
15 England                       XY                       03                   0844      KRIT
16 Germany                       XZ                       05                   0243      ORiP
   position_in_unit hours_balance status_code
11                Y                2771           1
12               DE                3783           1
13                G                1267           1
14               DE                7798           1
15                G                1136           1
16                M                6197           1

with these dimensions:

 dim(df_clas_sn)
[1] 4806    8

and I'm running H2O on it:

results2 - lares::h2o_automl(df = df_clas_sn
                             , y = status_code
                             , seed = 123
                             , max_time = 240
                             , impute = FALSE
                             , center = TRUE
                             , scale = TRUE
                             , max_models = 5
                             , alarm = FALSE)

to predict the status of the devices. The resulting confusion matrix results2$plots$metrics$conf_matrix is this:

I wonder about two things: Why is the TN so incredible low, almost 0 %? And the other: Though there are almost 5000 observations, the confusion matrix tells it used only 1442 observations. Why? How can I ensure it is using the entire data?

The model seems to be fine:

 results2
Model (1/5): DRF_1_AutoML_2_20220517_100320
Independent Variable: status_code
Type: Classification (2 classes)
Algorithm: DRF
Split: 70% training data (of 4806 observations)
Seed: 123

Test metrics:
   AUC = 0.90643
   ACC = 0.91401
   PRC = 0.91389
   TPR = 1
   TNR = 0.015873

Topic h2o confusion-matrix r

Category Data Science

why is H2O using only a part of the data?

About