A classical but still popular approach to detect univariate outlier is the boxplot method introduced by Tukey 1977. The boxplot is a simple graphical tool to display information about continuous univariate data (e.g., median, lower and upper quartile). Outliers are defined as values deviating more than \(1.5 * IQR\) from the 1st (\(Q_{25}\)) or 3rd (\(Q_{75}\)) quartile. The strength of Tukey’s method is that it makes no distributional assumptions and thus is also applicable to skewed or non mound-shaped data Seo, 2006,. Nevertheless, this method tends to identify frequent measurements which are falsely interpreted as true outliers.
A somewhat more conservative approach in terms of symmetric and/or normal distributions is the 3 standard deviation (SD) method, i.e. any measurement not in the interval of \(\bar{x} \pm 3*SD\) is considered an outlier.
Both methods mentioned above are not ideally suited to skewed
distributions. As many biomarkers such as laboratory measurements
represent in skewed distributions the methods above may be insufficient.
The approach of
Hubert and Vandervieren 2008 adjusts the
boxplot for the skewness of the distribution. This approach is
implemented in several R packages such as robustbase
which
is used in this implementation of dataquieR
.
Another completely heuristic approach is also included to identify outliers. The approach is based on the assumption that the distances between measurements of the same underlying distribution should be homogeneous. For comprehension of this approach: a) consider an ordered sequence of all measurements b) between these measurements all distances are calculated c) the occurrence of larger distances between two neighboring measurements may then indicate a distortion of the data. For the heuristic definition of a large distance \(1*\sigma\) has been been chosen.
In this way, the acc_robust_univariate_outlier
function
is an implementation of the Univariate
outliers indicator, which belongs to the Unexpected distributions domain in the
Accuracy dimension.
For more details, see the user’s manual, source code.
acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = NULL,
study_data = sd1,
meta_data = md1,
exclude_roles = NULL,
n_rules = 4,
max_non_outliers_plot = 10000
)
The function has the following arguments:
The function is designed for unimodal data only and does not use thresholds other than defined by the applied methods. See Description for details.
To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
For the acc_robust_univariate_outlier
function, the
columns DATA_TYPE
and MISSING_LIST
in the
metadata are relevant.
VAR_NAMES | LABEL | MISSING_LIST | DATA_TYPE | |
---|---|---|---|---|
3 | v00002 | SEX_0 | NA | integer |
4 | v00003 | AGE_0 | NA | integer |
6 | v01003 | AGE_1 | NA | integer |
7 | v01002 | SEX_1 | NA | integer |
15 | v00109 | ARM_CIRC_DISC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | integer |
16 | v00010 | ARM_CUFF_0 | 99980 | 99987 | integer |
19 | v00013 | EXAM_DT_0 | NA | datetime |
24 | v00017 | LAB_DT_0 | NA | datetime |
26 | v00018 | EDUCATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
27 | v01018 | EDUCATION_1 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
31 | v00022 | EATING_PREFS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
32 | v00023 | MEAT_CONS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
38 | v00029 | PREGNANT_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
This example specifies the analyses of univariate outliers for the complete dataset:
univ_outlier_1 <- acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
The summary table of this function is called using
univ_outlier_1$SummaryTable
.
Variables | Mean | No.records | SD | Median | Skewness | Tukey (N) | 3SD (N) | Hubert (N) | Sigma-gap (N) | NUM_acc_ud_outlu | Outliers, low (N) | Outliers, high (N) | GRADING | PCT_acc_ud_outlu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AGE_0 | 49.91 | 2940 | 4.42 | 50.00 | 0.00 | 11 | 2 | 11 | 0 | 0 | 0 | 0 | 0 | 0.00 |
AGE_1 | 49.87 | 2940 | 4.43 | 50.00 | 0.00 | 11 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0.00 |
SBP_0 | 126.52 | 2561 | 9.61 | 127.00 | 0.00 | 12 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 0.00 |
DBP_0 | 81.29 | 2544 | 9.21 | 81.00 | 0.00 | 14 | 3 | 14 | 0 | 0 | 0 | 0 | 0 | 0.00 |
GLOBAL_HEALTH_VAS_0 | 5.03 | 2618 | 2.92 | 5.00 | 0.02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ARM_CIRC_0 | 25.03 | 2657 | 3.96 | 25.00 | 0.00 | 4 | 9 | 4 | 0 | 0 | 0 | 0 | 0 | 0.00 |
CRP_0 | 2.89 | 2699 | 1.81 | 2.59 | 0.16 | 66 | 27 | 12 | 0 | 0 | 0 | 0 | 0 | 0.00 |
BSG_0 | 14.86 | 2686 | 12.13 | 11.00 | 0.33 | 93 | 42 | 93 | 1 | 1 | 0 | 1 | 1 | 0.04 |
DEV_NO_0 | 2.76 | 2692 | 1.35 | 3.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
N_CHILD_0 | 2.50 | 2336 | 1.53 | 2.00 | 0.33 | 32 | 8 | 173 | 0 | 0 | 0 | 0 | 0 | 0.00 |
N_INJURIES_0 | 4.59 | 2199 | 2.42 | 4.00 | 0.20 | 38 | 20 | 30 | 0 | 0 | 0 | 0 | 0 | 0.00 |
N_BIRTH_0 | 3.46 | 1099 | 1.77 | 3.00 | 0.20 | 27 | 5 | 30 | 1 | 1 | 0 | 1 | 1 | 0.09 |
N_ATC_CODES_0 | 2.26 | 2058 | 2.73 | 1.00 | 0.50 | 121 | 39 | 0 | 2 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_1_0 | 3.04 | 2248 | 1.76 | 3.00 | 0.00 | 34 | 12 | 34 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_2_0 | 2.99 | 2197 | 1.70 | 3.00 | 0.00 | 24 | 5 | 24 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_3_0 | 3.01 | 2184 | 1.72 | 3.00 | 0.00 | 26 | 7 | 26 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_4_0 | 3.00 | 2143 | 1.72 | 3.00 | 0.00 | 32 | 8 | 32 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_5_0 | 6.02 | 2074 | 2.37 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_6_0 | 5.95 | 2048 | 2.37 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_7_0 | 6.04 | 2068 | 2.40 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
ITEM_8_0 | 5.89 | 2013 | 2.40 | 6.00 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 |
The respective plot list is obtained by
univ_outlier_1$SummaryPlotList
:
Only selected output is shown to reduce the size of this file.
In this example the plot size - or more accurately, the number of
plotted observations - is reduced by setting
max_non_outliers_plot = 500
. The function samples n=500
observations from those being not outliers. This might be beneficial to
reduce plotting times and to reduce plot size in rendered documents.
univ_outlier_2 <- acc_robust_univariate_outlier(
resp_vars = NULL,
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
max_non_outliers_plot = 500
)
Statistical outliers do not necessarily represent implausible measurements. It is up to the user how outliers are handled.
This implementation uses several ways to identify outliers but is not comprehensive, i.e. there exist further methods in this manner.
This function has still some deficits. For example, the formal
n_rules
considers currently only the number of violated
rules. This functionality will be replaced by providing the possibility
to select specific outlier rules in a next release. Further, this
implementation can be applied on discrete data elements. In some cases
this will not make sense, i.e. the meaningful application depends on
user discretion.