The function acc_distributions
implements a range of
indicators and descriptors belonging to the Unexpected distributions domain in the
Accuracy dimension. It performs
location and proportion checks, as defined in the metadata, providing
data quality indicators for Unexpected
location and Unexpected
proportion.
Moreover, this implementation generates histograms (for float data
types) and bar plots (for integer data types), which are a frequent
approach to visualize the data distribution and possible data quality
issues. In this way, acc_distributions
is also a descriptor
for Univariate outliers, Unexpected shape, and Unexpected scale. Note however, that for
outliers there exist dedicated functins to not only provide descriptors
but also outliers.
acc_distributions(
resp_vars = NULL,
group_vars = NULL,
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
The function has the following arguments:
NULL
for
output without grouping.To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
To calculate the Unexpected
location and Unexpected
proportion indicators, the columns LOCATION_METRIC
,
LOCATION_RANGE
, and PROPORTION_RANGE
, must be
specified in the metadata:
If the metadata does not contain these columns, the output will only provide distribution plots for the variables with float or integer data types.
This is the simplest example, specifying only response variables
(SBP_0
, for systolic blood pressure measurement,
SEX_0
, and ITEM_4_0
of a questionnaire), the
study data, and the associated metadata:
dist_1 <- acc_distributions(
resp_vars = c("SBP_0", "SEX_0", "ITEM_4_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
Output 1: SummaryTable
acc_distributions
returns three objects. The first two
data frames (SummaryTable
and SummaryData
)
contain the data quality checks for Unexpected location
(FLG_acc_ud_loc
and VAL_acc_ud_loc
) and Unexpected proportion for the response
variables. SummaryTable
provides a concise summary of the
results, which is used by dq_report2
to populate the
accuracy section of the data quality report. Hence, the output is
minimal and the names of the columns are abbreviations. The
VAL
columns give the calculated value(s) for unexpected
location or proportion, respectively. When an unexpected location or
proportion is found, the FLG
columns provides a flag for
the corresponding variable. Call it with
dist_1$SummaryTable
:
Variables | values_from_data | GRADING | FLG_acc_ud_loc | loc_func | FLG_acc_ud_prop | prop_range |
---|---|---|---|---|---|---|
SBP_0 | 126.516204607575 | 0 | FALSE | mean | NA | NA |
SEX_0 | 0 = 50.3 | 1 = 49.7 | 0 | NA | NA | FALSE | 0 in [48;52] |
ITEM_4_0 | 0 = 4.9 | 1 = 14.2 | 2 = 22.9 | 3 = 23.3 | 4 = 16.6 | 5 = 10.3 | 6 = 4.2 | 7 = 2.1 | 8 = 1.1 | 9 = 0.3 | 10 = 0.1 | 1 | NA | NA | TRUE | 4 in (2;10] | 5 in (5;15] | 6 in (2;10] |
Output 2: SummaryData
The next output, SummaryData
, presents the data quality
checks using explicit labels. It includes the response variable analysed
with its corresponding expected range and measure of location (specified
in the metadata), as reference. The columns Value
and
Proportions
show the calculated result, and according to
this, a binary flag is raised if values are outside the expectations.
Use dist_1$SummaryData
to print the result:
Variables | Range of expected values | Flag | Measure of location | Value | Proportions |
---|---|---|---|---|---|
SBP_0 | (100;140) | FALSE | mean | 126.5162 | NA |
SEX_0 | 0 in [48;52] | FALSE | NA | NA | 0 = 50.3 | 1 = 49.7 |
ITEM_4_0 | 4 in (2;10] | 5 in (5;15] | 6 in (2;10] | TRUE | NA | NA | 0 = 4.9 | 1 = 14.2 | 2 = 22.9 | 3 = 23.3 | 4 = 16.6 | 5 = 10.3 | 6 = 4.2 | 7 = 2.1 | 8 = 1.1 | 9 = 0.3 | 10 = 0.1 |
Output 3: SummaryPlotList
The last output contains a list of ggplots
for each
variable in resp_vars
. The plot shows the
LOCATION_RANGE
or PROPORTION_RANGE
as well as
the LOCATION_METRIC
. Observations are highlighted if they
fall outside of the expected range.
dist_1$SummaryPlotList
## $SBP_0
##
## $SEX_0
##
## $ITEM_4_0
This example considers the SBP_0
(systolic blood
pressure measurement) with the grouping variable USR_BP_0
(examiner for the blood pressure measurement):
dist_2 <- acc_distributions(
resp_vars = "SBP_0",
group_vars = "USR_BP_0",
label_col = "LABEL",
study_data = sd1,
meta_data = md1
)
When the user specifies group_vars
, the output
dist_2$SummaryPlotList
includes a list of distribution
plots with their respective empirical Cumulative Distribution Function
(eCDF).
dist_2$SummaryPlotList
## $SBP_0
The higher the number of variables with unexpected location or proportions, the lower the data quality. Deviations from the expected central tendency or unexpected proportions might indicate data issues and should be further investigated.
NA
or only one unique
value (excluding NA
s).LOCATION_METRIC
(either mean or median) and
LOCATION_RANGE
(the range of expected values for the mean
or median, respectively).PROPORTION_RANGE
(the range of expected values for the
proportions of the categories). (7)Plot histograms and bar charts.group_vars
is specified by the user, output
group-wise empirical cumulative distributions.Because histogram classes are close to the density of the respective
distributions, instead of the default approach from
Sturges 1926,
acc_distributions
uses the method of Freedman and Diaconis
(Freedman and Diaconis 1981) to define
the number of bins and breaks in histograms. The number of bins is
calculated as:
\[ No. \: of \: bins = 2* \frac{IQR(x)}{\sqrt[3] n} \]
If group_vars
is given, the empirical Cumulative
Distribution Function (eCDF) is also presented
(Drion et al. 1952).
For more details, see the user’s manual and source code.