Data quality reports may need to be adapted to specific demands of
the context in which they are used. Therefore, the output of
dq_report2
can be customized in multiple ways:
When a report is created, a default title (“Data quality report”) and
subtitle (the date of creation) are provided. However, to provide
context, users usually need to specify the report’s content. Therefore,
it is possible to customize the title and subtitle using the arguments
title
and subtitle
when calling the function
dq_report2
.
dq_report2(study_data = ...,
meta_data_v2 = ...,
title = "my title",
subtitle = "my subtitle")
Hereafter we show a practical example using our SHIP-based Example data and metadata.
title
and
subtitle
You can define the report title and subtitle as shown hereafter:
rep_title<- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
title = "SHIP-based example data quality report",
subtitle = "The SHIP-based example data are suitable to illustrate the functioning of dq_report2")
You can then display the output in the Viewer panel typing:
rep_title
Data quality reports may address distinct data quality dimensions.
The reference data quality concept
distinguishes four dimensions. By default dq_report2
creates the descriptive statistics and three of these dimensions:
Integrity
, Completeness
, and
Consistency
. Accuracy
is omitted because it is
the computationally most demanding dimension. To adapt the report’s
scope to individual needs, users can select which dimensions to assess
using the dimensions
argument when calling
dq_report2
. However, Integrity
and descriptive
statistics are always present, even if not explicitly requested.
Note: the scope of data quality checks depends on
the provided metadata. For example, while Consistency
checks can be requested, if the necessary metadata (e.g., range
violations) are not provided, this section of the report will be
empty.
Hereafter, we show a practical example using our SHIP-based Example data and metadata.
dimensions
To obtain the default results, simply call dq_report2
with the study data and metadata; it will then generate descriptive
statistics and all possible results for the three dimensions:
Integrity
, Completeness
, and
Consistency
.
dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2")
To create a report containing all possible results from all four
dimensions, you have to specify dimensions = NULL
dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = NULL)
To create a report only containing descriptive statistics and the
Integrity
dimension, you have to set
dimensions = "Integrity"
dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = "Integrity")
To compute descriptive statistics and the dimensions
Integrity
, Completeness
, and
Accuracy
, you have to set
dimensions = c("Completeness", "Accuracy")
dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = c("Completeness", "Accuracy"))
Note: dimensions can also be abbreviated using “int”, “com”, “con”, and “acc”.
dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = c("com", "acc"))
For each dimension, dq_report2
applies a default set of
data quality indicator functions (see here
for a complete list). However, only some of these functions might be
relevant for a specific application. Therefore, you can select specific
functions using the filter_indicator_functions
argument
when calling dq_report2
:
dq_report2(study_data = ...,
meta_data_v2 = ...,
filter_indicator_functions = "..." )
Hereafter we show a practical example using the SHIP-based Example data and metadata.
The function con_limit_deviations
is used to check range
violations. To create a report containing only the results of this
function write the following code:
rep1 <- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
filter_indicator_functions = "con_limit_deviations.*")
rep1
Note: con_limit_deviations.*
is a
regular expression. The .* (dot star) allow dq_report2
to
assess range violations for all variables.
In this example, we want to create a report containing the
descriptive statistics (functions des_summary_categorical
and des_summary_continuous
), range violations
(con_limit_deviations
), and univariate outliers
(acc_univariate_outlier
) for all variables. To do so, all
relevant functions need to be listed in
filter_indicator_functions
:
rep2 <- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = NULL,
filter_indicator_functions = c("des_summary.*",
"con_limit_deviations.*",
"acc_univariate_outlier.*"))
rep2
Note: any regular expression that leads to the identification of the relevant functions will work. However, it has always to end with .* to be able to select all the output for the specified function.
Now we want to assess univariate and multivariate outliers. Two
functions exist for this purpose: acc_univariate_outlier
and acc_multivariate_outlier
.
You may either enter both functions in full and separately:
rep3a <- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = NULL,
filter_indicator_functions = c("acc_univariate_outlier.*",
"acc_multivariate_outlier.*"))
rep3a
or use an appropriate regular expression to call both at once:
rep3b <- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = NULL,
filter_indicator_functions = c(".*outlier.*"))
rep3b
In addition to selecting data quality indicator functions you may
want to modify their functioning. This may be done referring to any of
the arguments for each function using the argument
specific_args
when calling dq_report2
. A list
of valid arguments for each function can be found here. Using specific_args
overrides any default settings.
This approach is illustrated in the code below:
dq_report2(study_data = ...,
meta_data_v2 = ...,
specific_args = list("function name" = list(argument1 = ...,
argument2 = ...,
...)))
Hereafter we show a practical example using our Synthetic Example data and metadata.
The com_item_missingness
function describes the
missingness of individual variables in the data. Depending on your
preferences, the argument include_sysmiss
of this function
can be used to control the inclusion of system missing values (NAs) in
the resulting plot. The default setting is TRUE
, including
system missing values; setting it to FALSE
will hide system
missing values.
rep_miss <- dq_report2(study_data = "study_data",
meta_data_v2 = "meta_data_v2",
specific_args = list("com_item_missingness" =
list(include_sysmiss = FALSE)))
rep_miss
Below, the first plot represents the result for a default report,
showing system missing values in the first category on the x-axis
(ADDED:SysMiss
). The second plot shows the output without
ADDED:SysMiss
.
A data quality report can be created for selected variables from the
data. This may be done listing the desired variables in the argument
resp_vars
when calling dq_report2
. This
approach is illustrated in the code below:
dq_report2(resp_vars = ...,
study_data = ...,
meta_data_v2 = ...)
Hereafter we show a practical example using the SHIP-based Example data and metadata.
Depending on your preferences, it may or may not be desirable to
create a report only for a specific set of variables from the data. This
can be done using the argument resp_vars
. In the following
example, we create a report containing data quality checks for three
variables: “sbp1”, “dbp1”, and “height”.
rep_vars <- dq_report2(resp_vars = c("sbp1", "dbp1", "height"),
study_data = "ship",
meta_data_v2 = "ship_meta_v2")
rep_vars