Isolation Forest in R using Solitude - From the results how can I identify the anomalous records
I am trying to use the Isolation Forest algorithm in the Solitude package to identify anomalous rows in my data.
I'm using the examples in the documentation to learn about the algorithm, this example uses the Pima Indians Diabetes dataset.
At the end of the example it provides a dataframe of ids, average_depth and anomaly_score sorted from highest score to lowest.
How can I tie back the results of the model to the original dataset to see the rows with the highest anomaly score?
Here's the example from the package documentation
library(solitude)
library(tidyverse)
library(mlbench)
data(PimaIndiansDiabetes)
PimaIndiansDiabetes = as_tibble(PimaIndiansDiabetes)
PimaIndiansDiabetes
splitter = PimaIndiansDiabetes %%
select(-diabetes) %%
rsample::initial_split(prop = 0.5)
pima_train = rsample::training(splitter)
pima_test = rsample::testing(splitter)
iso = isolationForest$new()
iso$fit(pima_train)
scores_train = pima_train %%
iso$predict() %%
arrange(desc(anomaly_score))
scores_train
umap_train = pima_train %%
scale() %%
uwot::umap() %%
setNames(c(V1, V2)) %%
as_tibble() %%
rowid_to_column() %%
left_join(scores_train, by = c(rowid = id))
umap_train
umap_train %%
ggplot(aes(V1, V2)) +
geom_point(aes(size = anomaly_score))
scores_test = pima_test %%
iso$predict() %%
arrange(desc(anomaly_score))
scores_test
Topic isolation-forest unsupervised-learning r machine-learning
Category Data Science