why is H2O using only a part of the data?
I have this dataframe:
head(df_clas_sn)
country serial_no_of_generator_1 serial_no_of_generator_2 serial_no_of_generator_3 unit_type
11 Germany XY 01 0620 ORiP
12 India XY 01 0631 ORiP
13 Germany XY 02 0683 ORiP
14 Germany XZ 02 0735 KRIT
15 England XY 03 0844 KRIT
16 Germany XZ 05 0243 ORiP
position_in_unit hours_balance status_code
11 Y 2771 1
12 DE 3783 1
13 G 1267 1
14 DE 7798 1
15 G 1136 1
16 M 6197 1
with these dimensions:
dim(df_clas_sn)
[1] 4806 8
and I'm running H2O
on it:
results2 - lares::h2o_automl(df = df_clas_sn
, y = status_code
, seed = 123
, max_time = 240
, impute = FALSE
, center = TRUE
, scale = TRUE
, max_models = 5
, alarm = FALSE)
to predict the status of the devices. The resulting confusion matrix results2$plots$metrics$conf_matrix
is this:
I wonder about two things: Why is the TN so incredible low, almost 0 %? And the other: Though there are almost 5000 observations, the confusion matrix tells it used only 1442 observations. Why? How can I ensure it is using the entire data?
The model seems to be fine:
results2
Model (1/5): DRF_1_AutoML_2_20220517_100320
Independent Variable: status_code
Type: Classification (2 classes)
Algorithm: DRF
Split: 70% training data (of 4806 observations)
Seed: 123
Test metrics:
AUC = 0.90643
ACC = 0.91401
PRC = 0.91389
TPR = 1
TNR = 0.015873
Topic h2o confusion-matrix r
Category Data Science