A classical but still popular approach to detect univariate outlier
is the boxplot method introduced by
Tukey 1977. The boxplot is a simple
graphical tool to display information about continuous univariate data
(e.g., median, lower and upper quartile). Outliers are defined as values
deviating more than \(1.5 * IQR\) from
the 1st (\(Q_{25}\)) or 3rd (\(Q_{75}\)) quartile. The strength of Tukey’s
method is that it makes no distributional assumptions and thus is also
applicable to skewed or non mound-shaped data
Seo, 2006,. Nevertheless, this method tends to identify
frequent measurements which are falsely interpreted as *true*
outliers.

A somewhat more conservative approach in terms of symmetric and/or normal distributions is the 3 standard deviation (SD) method, i.e. any measurement not in the interval of \(\bar{x} \pm 3*SD\) is considered an outlier.

Both methods mentioned above are not ideally suited to skewed
distributions. As many biomarkers such as laboratory measurements
represent in skewed distributions the methods above may be insufficient.
The approach of
Hubert and Vandervieren 2008 adjusts the
boxplot for the skewness of the distribution. This approach is
implemented in several R packages such as `robustbase`

which
is used in this implementation of `dataquieR`

.

Another completely heuristic approach is also included to identify
outliers. The approach is based on the assumption that the distances
between measurements of the same underlying distribution should be
homogeneous. For comprehension of this approach: a) consider an ordered
sequence of all measurements b) between these measurements all distances
are calculated c) the occurrence of larger distances between two
neighboring measurements may then indicate a distortion of the data. For
the *heuristic* definition of a **large distance**
\(1*\sigma\) has been been chosen.

In this way, the `acc_robust_univariate_outlier`

function
is an implementation of the Univariate
outliers indicator, which belongs to the Unexpected distributions domain in the
Accuracy dimension.

For more details, see the user’s manual, source code.

```
acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = NULL,
study_data = sd1,
meta_data = md1,
exclude_roles = NULL,
n_rules = 4,
max_non_outliers_plot = 10000
)
```

The function has the following arguments:

**study_data**: mandatory, the data frame containing the measurements.**meta_data**: mandatory, the data frame containing the study data’s metadata.**resp_vars**: mandatory, a character specifying the measurement variable of interest. The variable must be of float or integer type.**label_col**: optional, the column in the metadata data frame containing the labels of all the variables in the study data.**exclude_roles**: optional, a character (vector) of variable roles not included.**n_rules**: optional, the number of rules that must be violated to classify as outlier.**max_non_outliers_plot**: optional, an integer (default = 10000) specifying the maximum number of observations (being not classified as outlier) used in the plots, relevant for large data to reduce plot size.**CAVEAT:**if this formal is used, the ggplot output will contain less observations than the original data.

The function is designed for unimodal data only and does not use thresholds other than defined by the applied methods. See Description for details.

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the `acc_robust_univariate_outlier`

function, the
columns `DATA_TYPE`

and `MISSING_LIST`

in the
metadata are relevant.

VAR_NAMES | LABEL | MISSING_LIST | DATA_TYPE | |
---|---|---|---|---|

3 | v00002 | SEX_0 | NA | integer |

4 | v00003 | AGE_0 | NA | integer |

6 | v01003 | AGE_1 | NA | integer |

7 | v01002 | SEX_1 | NA | integer |

15 | v00109 | ARM_CIRC_DISC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | integer |

16 | v00010 | ARM_CUFF_0 | 99980 | 99987 | integer |

19 | v00013 | EXAM_DT_0 | NA | datetime |

24 | v00017 | LAB_DT_0 | NA | datetime |

26 | v00018 | EDUCATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

27 | v01018 | EDUCATION_1 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

31 | v00022 | EATING_PREFS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

32 | v00023 | MEAT_CONS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

38 | v00029 | PREGNANT_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |

This example specifies the analyses of univariate outliers for the complete dataset:

```
univ_outlier_1 <- acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
```

The summary table of this function is called using
`univ_outlier_1$SummaryTable`

.

Variables | Mean | No.records | SD | Median | Skewness | Tukey (N) | 3SD (N) | Hubert (N) | Sigma-gap (N) | NUM_acc_ud_outlu | Outliers, low (N) | Outliers, high (N) | GRADING | PCT_acc_ud_outlu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

AGE_0 | 49.91 | 2940 | 4.42 | 50.00 | 0.00 | 11 | 2 | 11 | 0 | 0 | 0 | 0 | 0 | 0.00 |

AGE_1 | 49.87 | 2940 | 4.43 | 50.00 | 0.00 | 11 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0.00 |

SBP_0 | 126.52 | 2561 | 9.61 | 127.00 | 0.00 | 12 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 0.00 |

DBP_0 | 81.29 | 2544 | 9.21 | 81.00 | 0.00 | 14 | 3 | 14 | 0 | 0 | 0 | 0 | 0 | 0.00 |

GLOBAL_HEALTH_VAS_0 | 5.03 | 2618 | 2.92 | 5.00 | 0.02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ARM_CIRC_0 | 25.03 | 2657 | 3.96 | 25.00 | 0.00 | 4 | 9 | 4 | 0 | 0 | 0 | 0 | 0 | 0.00 |

CRP_0 | 2.89 | 2699 | 1.81 | 2.59 | 0.16 | 66 | 27 | 12 | 0 | 0 | 0 | 0 | 0 | 0.00 |

BSG_0 | 14.86 | 2686 | 12.13 | 11.00 | 0.33 | 93 | 42 | 93 | 1 | 1 | 0 | 1 | 1 | 0.04 |

DEV_NO_0 | 2.76 | 2692 | 1.35 | 3.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

N_CHILD_0 | 2.50 | 2336 | 1.53 | 2.00 | 0.33 | 32 | 8 | 173 | 0 | 0 | 0 | 0 | 0 | 0.00 |

N_INJURIES_0 | 4.59 | 2199 | 2.42 | 4.00 | 0.20 | 38 | 20 | 30 | 0 | 0 | 0 | 0 | 0 | 0.00 |

N_BIRTH_0 | 3.46 | 1099 | 1.77 | 3.00 | 0.20 | 27 | 5 | 30 | 1 | 1 | 0 | 1 | 1 | 0.09 |

N_ATC_CODES_0 | 2.26 | 2058 | 2.73 | 1.00 | 0.50 | 121 | 39 | 0 | 2 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_1_0 | 3.04 | 2248 | 1.76 | 3.00 | 0.00 | 34 | 12 | 34 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_2_0 | 2.99 | 2197 | 1.70 | 3.00 | 0.00 | 24 | 5 | 24 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_3_0 | 3.01 | 2184 | 1.72 | 3.00 | 0.00 | 26 | 7 | 26 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_4_0 | 3.00 | 2143 | 1.72 | 3.00 | 0.00 | 32 | 8 | 32 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_5_0 | 6.02 | 2074 | 2.37 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_6_0 | 5.95 | 2048 | 2.37 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_7_0 | 6.04 | 2068 | 2.40 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

ITEM_8_0 | 5.89 | 2013 | 2.40 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |

The respective plot list is obtained by
`univ_outlier_1$SummaryPlotList`

:

Only selected output is shown to reduce the size of this file.

In this example the plot size - or more accurately, the number of
plotted observations - is reduced by setting
`max_non_outliers_plot = 500`

. The function samples n=500
observations from those being not outliers. This might be beneficial to
reduce plotting times and to reduce plot size in rendered documents.

```
univ_outlier_2 <- acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
max_non_outliers_plot = 500
)
```

Statistical outliers do not necessarily represent implausible measurements. It is up to the user how outliers are handled.

- Select all variables of type float in the study data
- Remove missing codes from the study data (if defined in the metadata)
- Remove measurements deviating from limits defined in the metadata
- Identify outlier according to the approaches of Tukey (Tukey 1977), 3SD method (Saleem et al. 2021), Hubert (Hubert and Vandervieren 2008), and SigmaGap (heuristic)
- A output data frame is generated which indicates the no. of possible outlier, the direction of deviations (to low, to high) for all methods and a summary score which sums up the deviations of the different rules
- A scatter plot is generated for all examined variables, flagging observations according to the no. of violated rules (step 5).

This implementation uses several ways to identify outliers but is not comprehensive, i.e. there exist further methods in this manner.

This function has still some deficits. For example, the formal
`n_rules`

considers currently only the number of violated
rules. This functionality will be replaced by providing the possibility
to select specific outlier rules in a next release. Further, this
implementation can be applied on discrete data elements. In some cases
this will not make sense, i.e. the meaningful application depends on
user discretion.

- Data quality Indicator Univariate outliers

Hubert, M., and Vandervieren, E. (2008). An adjusted boxplot for skewed
distributions. Computational Statistics & Data Analysis *52*,
5186–5201.

Saleem, S., Aslam, M., and Shaukat, M.R. (2021). A review and empirical
comparison of univariate outlier detection methods. Pakistan Journal of
Statistics *37*.

Seo, S. (2006). A review and comparison of methods for detecting
outliers in univariate data sets.

Tukey, J.W. (1977). Exploratory data analysis (Addison-Wesley).