A standard tool to detect multivariate outliers is the Mahalanobis distance (Mahalanobis 1936, Filzmoser 2004). This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another.
In the acc_multivariate_outlier
function, the
Mahalanobis distance is used as a univariate measure itself. We apply
the same rules for the identification of outliers as in univariate
outliers:
robustbase
In this way, the acc_multivariate_outlier
function is an
implementation of the Multivariate
outliers indicator, which belongs to the Unexpected distributions domain in the
Accuracy dimension.
For more details, see the user’s manual, source code, and vignette for univariate outliers.
acc_multivariate_outlier(
variable_group = NULL,
id_vars = NULL,
label_col = NULL,
n_rules = 4,
max_non_outliers_plot = NULL,
criteria = NULL,
study_data = sd1,
meta_data = md1
)
The function has the following arguments:
rownumbers
are used.tukey
, 3SD
, hubert
and
sigmagap
.To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
For the acc_multivariate_outlier
function, the columns
DATA_TYPE
and MISSING_LIST
in the metadata are
relevant:
VAR_NAMES | LABEL | MISSING_LIST | DATA_TYPE | |
---|---|---|---|---|
3 | v00002 | SEX_0 | NA | integer |
4 | v00003 | AGE_0 | NA | integer |
6 | v01003 | AGE_1 | NA | integer |
7 | v01002 | SEX_1 | NA | integer |
15 | v00109 | ARM_CIRC_DISC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | integer |
16 | v00010 | ARM_CUFF_0 | 99980 | 99987 | integer |
19 | v00013 | EXAM_DT_0 | NA | datetime |
24 | v00017 | LAB_DT_0 | NA | datetime |
26 | v00018 | EDUCATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
27 | v01018 | EDUCATION_1 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
31 | v00022 | EATING_PREFS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
32 | v00023 | MEAT_CONS_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
38 | v00029 | PREGNANT_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | integer |
This example specifies the analyses of multivariate outliers for three variables:
mult_outlier <- acc_multivariate_outlier(
variable_group = c("SBP_0", "DBP_0", "AGE_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
The summary table contains only one line for the respective set of
variables tested for multivariate outliers. According to the number of
rules (n_rules
formal) that must be violated, the last
columns GRADING
will be \(\in {0;
1}\). In this example only one observation appears to be a
multivariate outlier according to all four rules. The summary table is
shown using mult_outlier$SummaryTable
:
Variables | Tukey (N) | 3SD (N) | Hubert (N) | Sigma-gap (N) | NUM_acc_ud_outlm | PCT_acc_ud_outlm | GRADING |
---|---|---|---|---|---|---|---|
SBP_0 | DBP_0 | AGE_0 | 78 | 32 | 6 | 1 | 1 | 0.04 | 1 |
In addition to the SummaryTable, an object called FlaggedStudyData is returned. This object can be used to identify observations which present multivariate outlier.
The summary plot uses five different colors to indicate the plausibility of multivariate outliers. In case of dark red observations all four rules identifying outliers have been violated.
mult_outlier$SummaryPlot
The FlaggedStudyData
contains the original data frame
with the additional columns tukey
, 3SD
,
Hubert
, and SigmaGap
. Every observation is
coded 0
if no outlier was detected in the respective column
and 1
if an outlier was detected. This can be used to
exclude observations with outliers.
The respective data can be accessed using:
mult_outlier$FlaggedStudyData
An outlier according to statistical criteria does not necessarily imply implausible measurements. It is up to the user how outliers are handled. For a more detailed discussion of the methods see Morgenthaler, 2007,.
variable_group
This implementation has several limitations as it uses a heuristic approach to classify multivariate outliers. The basis is defined by the Mahalanobis distance (Mahalanobis 1936) which provides a univariate and standardized measure of distance from the multivariate center of the data. However, recommendations regarding the use of these values in terms of classifying multivariate outliers were not found. Applying the rules of univariate outliers on the Mahalanobis distance has shown reasonable results. Nevertheless, this approach is not supported by an underlying theory.