--- title: "Using 'outForest'" date: "`r Sys.Date()`" bibliography: "biblio.bib" link-citations: true output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using 'outForest'} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.width = 6, fig.height = 4, fig.align = "center" ) ``` ## Overview {outForest} is a random forest-based implementation of the method described in Chapter 7.1.2 ("Regression model based anomaly detection") of [@Chandola2009]. Each numeric variable is regressed onto all other variables using a random forest. If the scaled absolute difference between observed value and out-of-bag prediction is suspiciously large (e.g., more than three times the RMSE of the out-of-bag predictions), then a value is considered an outlier. After identification of outliers, they can be replaced, for instance, by predictive mean matching from the non-outliers. The method can be viewed as a multivariate extension of a basic univariate outlier detection method, in which a value is considered an outlier if it deviates from the mean by more than, say, three times the standard deviation. In the multivariate case, instead of comparing a value with the *overall mean*, rather the difference to the *conditional mean* is considered. {outForest} estimates this conditional mean by a random forest. Specifically, the outlier score of the $i$-th observed value $x_{ij}$ of variable $j$ is defined as $$ s_{ij} = \frac{x_{ij} - \text{pred}_{ij}}{\text{rmse}_j}, $$ where $\text{pred}_{ij}$ is the corresponding out-of-bag prediction of the $j$-th random forest with RMSE $\text{rmse}_j$. If $|s_{ij}| > L$ with threshold $L$, then $x_{ij}$ is considered an outlier. Why using random forests for calculating the conditional means? They often work well without parameter tuning, outliers in the input variables are no issue and the out-of-bag mechanism helps to provide fair outlier scores. {outForest} utilizes the high-performance random forest implementation {ranger} [@wright2017]. Since it does not allow for missing values, any missing value is first being imputed by chained random forests implemented in {missRanger}. In the examples below, we will meet different functions from the {outForest} package: - `outForest()`: This is the main function. It identifies and replaces outliers in numeric variables of a data frame. - `print()`, `summary()` and `plot()`: To inspect outlier information. - `outliers()` and `Data()`: To extract information on outliers and the data with replaced outliers. - `generateOutliers()`: To add nasty outliers to a data set. - `predict()`: To apply a fitted "outForest" object to fresh data. ## Installation ```r # From CRAN install.packages("outForest") # Development version devtools::install_github("mayer79/outForest") ``` ## Usage We first generate a data set with two multivariate outliers. Then, we use `outForest()` to find them. ``` {r} library(outForest) # Create data with multivariate outlier set.seed(3) t <- seq(0, pi, by = 0.01) dat <- data.frame(x = cos(t), y = sin(t) + runif(length(t), -0.1, 0.1)) dat[c(100, 200), ] <- cbind(c(-0.5, 0.5), c(0.4, 0.4)) plot(y ~ x, data = dat) # Let's run outForest on that data ch <- outForest(dat) # What outliers did we find? outliers(ch) # Bingo! How does the fixed data set looks like? plot(y ~ x, data = Data(ch)) # The number of outliers per variable plot(ch) ``` Note that `outForest()` offers a `...` argument to pass options to its workhorse random forest implementation `ranger()`, e.g., `num.trees` or `mtry`. How would we use its "extra trees" variant with 50 trees? As data set, we add a couple of outliers to the famous iris flower data set. ``` {r} head(irisWithOutliers <- generateOutliers(iris, p = 0.02)) out <- outForest( irisWithOutliers, splitrule = "extratrees", num.trees = 50, verbose = 0 ) # The worst outliers head(outliers(out)) # Summary of outliers summary(out) # Basic plot of the number of outliers per variable plot(out) # Basic plot of the scores of the outliers per variable plot(out, what = "scores") # The fixed data head(Data(out)) ``` ## Pipe `outForest()` can be combined with the pipe: ```r irisWithOutliers |> outForest(verbose = 0) |> Data() |> head() ``` ## Out-of-sample application Once an "outForest" object is fitted with option `allow_predictions = TRUE`, it can be used to find and replace outliers in new data without refitting. Note that random forests can be huge, so use this option with care. Further note that currently, missing values in the new data are only allowed in one single variable to be checked. ``` {r} out <- outForest(iris, allow_predictions = TRUE, verbose = 0) iris1 <- iris[1:2, ] iris1$Sepal.Length[1] <- -1 pred <- predict(out, newdata = iris1) outliers(pred) Data(pred) ``` ## The formula interface By default, `outForest()` uses all columns in the data set to check all numeric columns with missings. To override this behavior, you can use its formula interface: The left hand side specifies the variables to be checked (variable names separated by a `+`), while the right hand side lists the variables used for checking. If you want to check only variable `Sepal.Length` based on `Species`, then use this syntax. ``` {r} out <- outForest(irisWithOutliers, Sepal.Length ~ Species, verbose = 0) summary(out) ``` If you want to prevent `outForest()` to identify and replace outliers in the response variable of a subsequent model, `Sepal.Length` say, subtract that variable from the left hand side. ``` {r} out <- outForest(irisWithOutliers, . - Sepal.Length ~ ., verbose = 0) summary(out) ``` ## Too many values are identified as outliers. Crap! For large data sets, just by chance, many values can surpass the default threshold of 3. To reduce the number of outliers, the threshold can be increased. Alternatively, the number of outliers can be limited by the two arguments `max_n_outliers` and `max_prop_outliers`. For instance, if at most three outliers are to be identified, set `max_n_outliers = 3`. If at most 1% of the values are to be declared as outliers, set `max_prop_outliers = 0.01`. ```{r} outliers(outForest(irisWithOutliers, max_n_outliers = 3, verbose = 0)) ``` ## I don't want outliers to be replaced! By default `outForest()` replaces outliers by predictive mean matching on the out-of-bag predictions. Alternatively, you can set the `replace` argument to a different value: - `"no"`: Outliers are not replaced. - `"predicted"`: Outliers are replaced by the out-of-bag predictions (without predictive mean matching). - `"NA"`: Outliers are replaced by missing values. So if you would like to replace outliers by NAs, write: ``` {r} out <- outForest(irisWithOutliers, replace = "NA", verbose = 0) head(Data(out)) ``` ## The algorithm takes too much time. What can I do? By default, `outForest()` fits a random forest with 500 trees per numeric variable. Thus, the overall process can take very long. Here a few tweaks to make things faster: - Use less trees, e.g., by setting `num.trees = 50`. - Use smaller mtry, e.g., `mtry = 2`. - Use shorter trees, e.g., setting `max.depth = 6`. - Use large leaves, e.g., `min.node.size = 10000`. ## References