How to customize grading rulesets

Grading ruleset description

A grading ruleset is a table containing information to classify results. The results are represented by a set of indicator metrics, i.e. a numeric value that correlates with the size of a data quality problem related to one specific data quality indicator (the larger the number, the larger the issue or vice versa). Examples of these results or indicator metrics are the percentage of missing values, or the ICC value. These indicator metric values are classified in a maximum of 5 categories (dqi_cat_1 to dqi_cat_5) defined by different range of values. Category 1 (dqi_cat_1) corresponds to no data quality issues, whereas category 5 indicates critical data quality issues.

Here is the grading ruleset table used by dataquieR (you can download it using the link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx):

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

indicator_metric	Description	dqi_cat_1	dqi_cat_2	dqi_cat_3	dqi_cat_4	dqi_cat_5
PCT_int_vfe_type	Percentage of observational units with expected and observed data type not matching	[0; 0]	NA	NA	(0; 1)	[1; 100]
PCT_int_uenc	Percentage of observational units with text containing invalid characters with respect to the expected encoding	[0; 0]	(0; 100]	NA	NA	NA
PCT_com_crm_mv	Percentage of observational units with missing values (Nas, jumps, and missing codes)	[0; 1)	[1; 100]	NA	NA	NA
PCT_com_qum_nonresp	Percentage of non-response rate	[0; 1)	[1; 20)	[20; 100]	NA	NA
PCT_com_qum_refusal	Percentage of refusal rate	[0; 1)	[1; 20) ]	NA	NA	NA
PCT_con_rvv_inum	Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata	[0; 0]	NA	(0; 2)	[2; 5) ]	NA	NA	NA
PCT_con_rvv_itdat	Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata	[0; 0]	NA	(0; 2)	[2; 5) ]	NA	NA	NA
PCT_acc_ud_outlu	Percentage of observational units considered an outlier with respect to one variable	[0; 0]	(0; 2)	[2; 5)	[5; 10) ]	NA	NA	NA
FLG_acc_ud_prop	TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata	[0; 0]	[1; 1]	NA	NA	NA
FLG_acc_ud_loc	TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata	[0; 0]	[1; 1]	NA	NA	NA
ICC_acc_ud_loc	ICC values computed using a mixed effects model	[0; 0.02)	[0.02; 0.03)	[0.03; 0.05)	[0.05, 0.1)	[0.1; 1]
NUM_con_con_contc	Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible)	[0; 1) ]	NA
CAT_applicability	Technical use only, please do not remove	[1; 1]	[2; 2]	[3; 3]	[4; 4]	[5; 5]
CAT_error	Technical use only, please do not remove	[1; 1]	[2; 2]	[3; 3]	[4; 4]	[5; 5]
CAT_anamat	Technical use only, please do not remove	[1; 1]	[2; 2]	[3; 3]	[4; 4]	[5; 5]
CAT_indicator_or_descriptor	Technical use only, please do not remove	[1; 1]	[2; 2]	[3; 3]	[4; 4]	[5; 5]

The default grading ruleset used to classify the results is identified by the code GRADING_RULESET equal to 0. This can be read in the first column of the grading ruleset table above. Additional grading rulesets can be added to the table, using increasing numbers for GRADING RULESET (see below in how to customize rulesets). Since data quality is defined as the fitness for a certain purpose, also the grading rulesets depends on the purpose (ISO 2022). They may differ then from the one provided as default.

Hereafter you can find the column captions of the table and the relative definitions.

GRADING_RULESET, a number identifying the ruleset. The default ruleset is 0;
Indicator_metric, contains the indicator metrics present in dataquieR for which we want to define a classification. (If for example an indicator have multiple metrics, such as a flag FLG, a number NUM, or a percentage PCT, usually only one (i.e., the percentage) is listed here);
dqi_cat_1 to dqi_cat_5, contains the range in which a value must be contained to be assigned to the respective category, given the indicator metric.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

indicator_metric	Category
dqi_cat_1	Ok
dqi_cat_2	Unclear
dqi_cat_3	Moderate
dqi_cat_4	Important
dqi_cat_5	Critical

How to customize the ruleset

CASE 1: a customized ruleset that applies to all the variables

A customized ruleset can be specified by the user providing a new table “grading_rulesets” using prep_add_data_frames() before rendering the report.

prep_add_data_frames(grading_rulesets = "~/Documents/my_grading_rulesets.xlsx")

This new table should have the same columns as the default ruleset table seen above and the GRADING_RULESET with the number 0.

Hereafter you can find a complete example:

First you create a report, we use here the SHIP-based example data.

r1 <- dq_report2(study_data = "ship", 
                 meta_data_v2 = "ship_meta_v2", 
                 dimensions = NULL)

Then you modify the grading ruleset table that can be downloaded from the website with this link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx

For example you can modify it so that there are no ranges in the dqi_cat_1 corresponding to the category “OK”.

Then save the new table as my_grading_rulesets.xlsx.

Example of grading ruleset

Now you have to add the new grading rulesets to the dataquieR data frame storage area (see tutorial for more details), before rendering the report.

prep_add_data_frames(grading_rulesets = "~/my_grading_rulesets.xlsx")

Now you can render the report.

r1  # if you want to see it in the Viewer
print(r1, "~/Desktop/report_example") # if you want to save the report in a folder on the Desktop

CASE 2: different versions of grading rulesets for different variables

It is possible to assign different rulesets to specific variables. To do that, you have to:

add new rows with a different GRADING_RULESET number in the table grading_rulesets (e.g., a new set of GRADING_RULESET = 1; this can be a complete new table with different values or only a subset of indicator_metrics);
add a new column GRADING_RULESET in the item-level metadata, associating each variable to the number of the corresponding ruleset you want to use for that specific variable.

If no customized ruleset table is provided, the ruleset shipped with dataquieR will be used.

Hereafter there is an example with the SHIP-based example data, where we modified the grading ruleset only for a few variables:

First you need to create a new grading rule for the indicator metric ICC_acc_ud_loc in the table of “my_grading_rulesets.xlsx”, and save the table as "my_new_grading_rule_example2.xlsx". You can see the example of the new indicator metric below in yellow (the default version is the one in green).

Example of new grading ruleset

You have to apply this new grading rule (in yellow in the image) only to the variables sbp1, sbp2, dbp1, dbp2. To do it you need to modify the item_level metadata sheet from the original metadata file for the SHIP-based example data. For example you can add a new column GRADING_RULESET with 1 for the variables you want to have the new grading rule.

Example of item_level new column

After saving the new metadata file as "ship_meta_v2_new.xlsx" on the Desktop, you can create the report using the new version of the metadata file (because the item_level is now different from before).

r2 <- dq_report2(study_data = "ship", 
                 meta_data_v2 = "~/Desktop/ship_meta_v2_new.xlsx", 
                 dimensions = NULL)

Now add the new grading rulesets to the dataquieR data frame storage area before rendering the report.

prep_add_data_frames(grading_rulesets = "~/my_new_grading_rule_example2.xlsx")

Finally you can render the report and obtain the following result for “Variance proportion device”.

r2  # if you want to see it in the Viewer
print(r2, "~/Desktop/report_example_n2") # if you want to save the report in a folder on the Desktop

Example of ICC results

There you can see that Systolic and Diastolic blood pressures have the category Important (two variables with ICC<0.005) and Critical (two variables with ICC >0.005 ) based on the new GRADING_RULESET (1), whereas Body height is Unclear based on the old GRADING_RULESET (0).

How to customize the grading format

A grading format is a table containing the information on the RGB color codes (color) and text (label) used to represent the five data quality categories inside the report.

Here is the default grading format table used by dataquieR:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

category	label	color
1	Ok	33 102 172
2	Unclear	67 147 195
3	Moderate	227 186 20
4	Important	214 96 77
5	Critical	178 23 43

The desired color can be specified in different ways:

3 decimal numbers separated by spaces, representing the RGB components of the colors (e.g., “227 186 20”);
hexadecimal color codes (e.g., “#rrggbb”); or
color names (e.g., “red”)

Please look here for more information.

You can modify colors and text providing a new table grading_formats using prep_add_data_frames() before rendering the report. For example:

prep_add_data_frames(grading_formats = "~/Documents/my_grading_formats.xlsx")

Summary of indicator metrics currently present in the grading ruleset of dataquieR

Data quality indicator metric acronym	Data quality indicator metric	Dimension - Domain	Indicator	Function	Description
PCT_int_vfe_type	Percentage	Integrity - Value format error	Data type mismatch	int_datatype_matrix	Percentage of observational units with expected and observed data type not matching
PCT_int_uenc	Percentage	Integrity - Value format error	Inadmissible data format	int_encoding_errors	Percentage of observational units with text containing invalid characters with respect to the expected encoding
PCT_com_crm_mv	Percentage	Completeness - Crude missingness	Crude missingness	com_item_missingness	Percentage of observational units with missing values (Nas, jumps, and missing codes)
PCT_com_qum_nonresp	Percentage	Completeness - Qualified missingness	Item non-response rate	com_qualified_item_missingness, com_qualified_segment_missingness	Percentage of non-response rate see the explanation on qualified missingness labels from AAPOR
PCT_com_qum_refusal	Percentage	Completeness - Qualified missingness	Refusal rate	com_qualified_item_missingness, com_qualified_segment_missingness	Percentage of refusal rate see the explanation on qualified missingness labels from AAPOR
PCT_con_rvv_icat	Percentage	Consistency - Range and value violations	Inadmissible categorical values	con_inadmissible_categorical	Percentage of observational units with observed categories not matching any possible expected one
PCT_con_rvv_inum	Percentage	Consistency - Range and value violations	Inadmissible numerical values	con_limit_deviations	Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata
PCT_con_rvv_unum	Percentage	Consistency - Range and value violations	Uncertain numerical values	con_limit_deviations	Percentage of observational units with a numerical value outside the expected interval (plausible values) provided in the metadata
PCT_con_rvv_itdat	Percentage	Consistency - Range and value violations	Inadmissible datetime values	con_limit_deviations	Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata
PCT_con_rvv_utdat	Percentage	Consistency - Range and value violations	Uncertain datetime values	con_limit_deviations	Percentage of observational units with a date-time value outside the expected interval (plausible values) provided in the metadata
PCT_acc_ud_outlu	Percentage	Accuracy - Unexpected distributions	Univariate outliers	acc_univariate_outlier	Percentage of observational units considered an outlier with respect to one variable
FLG_acc_ud_shape	Flag	Accuracy - Unexpected distributions	Unexpected shape, unexpected scale	acc_shape_or_scale	TRUE/FALSE value indicating if the single intervals in the study data deviate significantly from a theoretical expected distribution
PCT_acc_ud_loc	Percentage	Accuracy - Unexpected distributions	Unexpected location	acc_margins	Percentage of groups (e.g., examiners) with a marginal mean outside the possible range identified by the threshold value
FLG_acc_ud_prop	Flag	Accuracy - Unexpected distributions	Unexpected proportion	acc_distributions	TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata
FLG_acc_ud_loc	Percentage	Accuracy - Unexpected distributions	Unexpected location	acc_distributions	TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata
ICC_acc_ud_loc	Variance proportion	Accuracy - Unexpected distributions	Unexpected location	acc_varcomp	ICC values computed using a mixed effects model
NUM_con_con_contc	Number	Consistency - Contradictions	Logical contradictions	con_contradictions_redcap	Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible)
PCT_con_con_contu	Percentage	Consistency - Contradictions	Empirical contradictions	con_contradictions_redcap	Percentage of observational units that are in conflict with the provided rule (implies that the combination of values in the rule is not impossible but unlikely)

Advanced users: modify options

A user can modify the options to specify the name of the table containing the grading rules, if that is different from “grading_rulesets”.

For example, let’s say you already have a sheet called mygrading in your metadata “mymeta.xlsx” (as in the image below), containing the grading rules for the report.

Example of metadata with grading ruleset

First you need to create a report with that metadata.

r1 <- dq_report2(study_data = "study_data", 
                 meta_data_v2 = "~/Desktop/mymeta.xlsx", 
                 dimensions = NULL)

If you continue immediately after creating the report, you do not need to re-add the metadata in the dataquieR data frame storage area. Otherwise you need to reload the metadata with prep_load_workbook_like_file("~/Desktop/mymeta.xlsx")

Then you need to modify the options to specify the name of the table you want to use as grading rules, in this example “mygrading”.

options("dataquieR.grading_rulesets" =
          "mygrading")

Now you can render the report.

r1 #it will be visible in the Viewer in RStudio

Back to Overview

ISO (2022). ISO 8000-1:2022 data quality part 1: overview.