Data quality reports may need to be adapted to specific demands of the context in which they are used. Therefore, the output of dq_report2 can be customized in multiple ways:

  1. specifying the report title and subtitle;
  2. selecting the data quality dimensions;
  3. selecting data quality indicator functions within data quality dimensions;
  4. modifying the use of single data quality indicator functions;
  5. selecting the variables.


1. Specifying the report title and subtitle

When a report is created, a default title (“Data quality report”) and subtitle (the date of creation) are provided. However, to provide context, users usually need to specify the report’s content. Therefore, it is possible to customize the title and subtitle using the arguments title and subtitle when calling the function dq_report2.

dq_report2(study_data = ..., 
           meta_data_v2 = ..., 
           title = "my title", 
           subtitle = "my subtitle")

Hereafter we show a practical example using our SHIP-based Example data and metadata.

Example: modify the argument title and subtitle

You can define the report title and subtitle as shown hereafter:

rep_title<- dq_report2(study_data = "ship",  
                meta_data_v2 = "ship_meta_v2",
                title = "SHIP-based example data quality report", 
                subtitle = "The SHIP-based example data are suitable to illustrate the functioning of dq_report2")

You can then display the output in the Viewer panel typing:

rep_title


2. Selecting the data quality dimensions

Data quality reports may address distinct data quality dimensions. The reference data quality concept distinguishes four dimensions. By default dq_report2 creates the descriptive statistics and three of these dimensions: Integrity, Completeness, and Consistency. Accuracy is omitted because it is the computationally most demanding dimension. To adapt the report’s scope to individual needs, users can select which dimensions to assess using the dimensions argument when calling dq_report2. However, Integrity and descriptive statistics are always present, even if not explicitly requested.

Note: the scope of data quality checks depends on the provided metadata. For example, while Consistency checks can be requested, if the necessary metadata (e.g., range violations) are not provided, this section of the report will be empty.

Hereafter, we show a practical example using our SHIP-based Example data and metadata.

Example: modify the argument dimensions

To obtain the default results, simply call dq_report2 with the study data and metadata; it will then generate descriptive statistics and all possible results for the three dimensions: Integrity, Completeness, and Consistency.

dq_report2(study_data = "ship", 
           meta_data_v2 = "ship_meta_v2")

To create a report containing all possible results from all four dimensions, you have to specify dimensions = NULL

dq_report2(study_data = "ship", 
           meta_data_v2 = "ship_meta_v2", 
           dimensions = NULL)

To create a report only containing descriptive statistics and the Integrity dimension, you have to set dimensions = "Integrity"

dq_report2(study_data = "ship", 
           meta_data_v2 = "ship_meta_v2", 
           dimensions = "Integrity")

To compute descriptive statistics and the dimensions Integrity, Completeness, and Accuracy, you have to set dimensions = c("Completeness", "Accuracy")

dq_report2(study_data = "ship", 
           meta_data_v2 = "ship_meta_v2", 
           dimensions = c("Completeness", "Accuracy"))

Note: dimensions can also be abbreviated using “int”, “com”, “con”, and “acc”.

dq_report2(study_data = "ship", 
           meta_data_v2 = "ship_meta_v2", 
           dimensions = c("com", "acc"))


3. Selecting data quality indicator functions within data quality dimensions

For each dimension, dq_report2 applies a default set of data quality indicator functions (see here for a complete list). However, only some of these functions might be relevant for a specific application. Therefore, you can select specific functions using the filter_indicator_functions argument when calling dq_report2:

dq_report2(study_data = ..., 
           meta_data_v2 = ..., 
           filter_indicator_functions = "..." )

Hereafter we show a practical example using the SHIP-based Example data and metadata.

Example I: assessing only range violations

The function con_limit_deviations is used to check range violations. To create a report containing only the results of this function write the following code:

rep1 <- dq_report2(study_data = "ship", 
                   meta_data_v2 = "ship_meta_v2",
                   filter_indicator_functions = "con_limit_deviations.*")
rep1

Note: con_limit_deviations.* is a regular expression. The .* (dot star) allow dq_report2 to assess range violations for all variables.


Example II: compute and assess descriptive statistics, range violations, and univariate outliers

In this example, we want to create a report containing the descriptive statistics (functions des_summary_categorical and des_summary_continuous), range violations (con_limit_deviations), and univariate outliers (acc_univariate_outlier) for all variables. To do so, all relevant functions need to be listed in filter_indicator_functions:

rep2 <- dq_report2(study_data = "ship", 
                   meta_data_v2 = "ship_meta_v2",
                   dimensions = NULL,
                   filter_indicator_functions = c("des_summary.*", 
                                                  "con_limit_deviations.*", 
                                                  "acc_univariate_outlier.*"))
rep2

Note: any regular expression that leads to the identification of the relevant functions will work. However, it has always to end with .* to be able to select all the output for the specified function.


Example III: assess univariate and multivariate outliers

Now we want to assess univariate and multivariate outliers. Two functions exist for this purpose: acc_univariate_outlier and acc_multivariate_outlier.

You may either enter both functions in full and separately:

rep3a <- dq_report2(study_data = "ship", 
                    meta_data_v2 = "ship_meta_v2",
                    dimensions = NULL,
                    filter_indicator_functions = c("acc_univariate_outlier.*",
                                                   "acc_multivariate_outlier.*"))
rep3a

or use an appropriate regular expression to call both at once:

rep3b <- dq_report2(study_data = "ship", 
                    meta_data_v2 = "ship_meta_v2",
                    dimensions = NULL,
                    filter_indicator_functions = c(".*outlier.*"))
rep3b


4. Modifying the use of single data quality indicator functions

In addition to selecting data quality indicator functions you may want to modify their functioning. This may be done referring to any of the arguments for each function using the argument specific_args when calling dq_report2. A list of valid arguments for each function can be found here. Using specific_args overrides any default settings.

This approach is illustrated in the code below:

dq_report2(study_data = ..., 
           meta_data_v2 = ..., 
           specific_args = list("function name" = list(argument1 = ..., 
                                                       argument2 = ..., 
                                                       ...)))

Hereafter we show a practical example using our Synthetic Example data and metadata.

Example: handling of system missing values (NAs) in the report

The com_item_missingness function describes the missingness of individual variables in the data. Depending on your preferences, the argument include_sysmiss of this function can be used to control the inclusion of system missing values (NAs) in the resulting plot. The default setting is TRUE, including system missing values; setting it to FALSE will hide system missing values.

rep_miss <- dq_report2(study_data = "study_data", 
                       meta_data_v2 = "meta_data_v2",
                       specific_args = list("com_item_missingness" = 
                                              list(include_sysmiss = FALSE)))
rep_miss

Below, the first plot represents the result for a default report, showing system missing values in the first category on the x-axis (ADDED:SysMiss). The second plot shows the output without ADDED:SysMiss.


5. Selecting the variables

A data quality report can be created for selected variables from the data. This may be done listing the desired variables in the argument resp_vars when calling dq_report2. This approach is illustrated in the code below:

dq_report2(resp_vars = ..., 
           study_data = ...,
           meta_data_v2 = ...)

Hereafter we show a practical example using the SHIP-based Example data and metadata.

Example: selecting a few variables

Depending on your preferences, it may or may not be desirable to create a report only for a specific set of variables from the data. This can be done using the argument resp_vars. In the following example, we create a report containing data quality checks for three variables: “sbp1”, “dbp1”, and “height”.

rep_vars <- dq_report2(resp_vars = c("sbp1", "dbp1", "height"), 
                       study_data = "ship", 
                       meta_data_v2 = "ship_meta_v2")
rep_vars




Back to Overview