Isolation Forest in R using Solitude - From the results how can I identify the anomalous records

Question

Isolation Forest in R using Solitude - From the results how can I identify the anomalous records

TheGoat

2022年1月21日 23:54

I am trying to use the Isolation Forest algorithm in the Solitude package to identify anomalous rows in my data.

I'm using the examples in the documentation to learn about the algorithm, this example uses the Pima Indians Diabetes dataset.

At the end of the example it provides a dataframe of ids, average_depth and anomaly_score sorted from highest score to lowest.

How can I tie back the results of the model to the original dataset to see the rows with the highest anomaly score?

Here's the example from the package documentation

library(solitude)
library(tidyverse)
library(mlbench)

data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes

splitter   = PimaIndiansDiabetes %%
  select(-diabetes) %%
  rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test  = rsample::testing(splitter)

iso = isolationForest$new()
iso$fit(pima_train)

scores_train = pima_train %%
  iso$predict() %%
  arrange(desc(anomaly_score))

scores_train

umap_train = pima_train %%
  scale() %%
  uwot::umap() %%
  setNames(c(V1, V2)) %%
  as_tibble() %%
  rowid_to_column() %%
  left_join(scores_train, by = c(rowid = id))

umap_train

umap_train %%
  ggplot(aes(V1, V2)) +
  geom_point(aes(size = anomaly_score))

scores_test = pima_test %%
  iso$predict() %%
  arrange(desc(anomaly_score))

scores_test

Topic isolation-forest unsupervised-learning r machine-learning

Category Data Science

Isolation Forest in R using Solitude - From the results how can I identify the anomalous records

About