Isolation Forest in R using Solitude - From the results how can I identify the anomalous records

I am trying to use the Isolation Forest algorithm in the Solitude package to identify anomalous rows in my data.

I'm using the examples in the documentation to learn about the algorithm, this example uses the Pima Indians Diabetes dataset.

At the end of the example it provides a dataframe of ids, average_depth and anomaly_score sorted from highest score to lowest.

How can I tie back the results of the model to the original dataset to see the rows with the highest anomaly score?

Here's the example from the package documentation

library(solitude)
library(tidyverse)
library(mlbench)

data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes

splitter   = PimaIndiansDiabetes %%
  select(-diabetes) %%
  rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test  = rsample::testing(splitter)

iso = isolationForest$new()
iso$fit(pima_train)

scores_train = pima_train %%
  iso$predict() %%
  arrange(desc(anomaly_score))

scores_train

umap_train = pima_train %%
  scale() %%
  uwot::umap() %%
  setNames(c(V1, V2)) %%
  as_tibble() %%
  rowid_to_column() %%
  left_join(scores_train, by = c(rowid = id))

umap_train

umap_train %%
  ggplot(aes(V1, V2)) +
  geom_point(aes(size = anomaly_score))

scores_test = pima_test %%
  iso$predict() %%
  arrange(desc(anomaly_score))

scores_test

Topic isolation-forest unsupervised-learning r machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.