Grading ruleset description

A grading ruleset is a table containing information to classify results. The results are represented by a set of indicator metrics, i.e. a numeric value that correlates with the size of a data quality problem related to one specific data quality indicator (the larger the number, the larger the issue or vice versa). Examples of these results or indicator metrics are the percentage of missing values, or the ICC value. These indicator metric values are classified in a maximum of 5 categories (dqi_cat_1 to dqi_cat_5) defined by different range of values. Category 1 (dqi_cat_1) corresponds to no data quality issues, whereas category 5 indicates critical data quality issues.

Here is the grading ruleset table used by dataquieR (you can download it using the link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx):

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
GRADING_RULESET indicator_metric Description dqi_cat_1 dqi_cat_2 dqi_cat_3 dqi_cat_4 dqi_cat_5
0 PCT_int_vfe_type Percentage of observational units with expected and observed data type not matching [0; 0] NA NA (0; 1) [1; 100]
0 PCT_int_uenc Percentage of observational units with text containing invalid characters with respect to the expected encoding [0; 0] (0; 100] NA NA NA
0 PCT_com_crm_mv Percentage of observational units with missing values (Nas, jumps, and missing codes) [0; 1) [1; 100] NA NA NA
0 PCT_com_qum_nonresp Percentage of non-response rate [0; 1) [1; 20) [20; 100] NA NA
0 PCT_com_qum_refusal Percentage of refusal rate [0; 1) [1; 20) ] NA NA NA
0 PCT_con_rvv_inum Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata [0; 0] NA (0; 2) [2; 5) ] NA NA NA
0 PCT_con_rvv_itdat Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata [0; 0] NA (0; 2) [2; 5) ] NA NA NA
0 PCT_acc_ud_outlu Percentage of observational units considered an outlier with respect to one variable [0; 0] (0; 2) [2; 5) [5; 10) ] NA NA NA
0 FLG_acc_ud_prop TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata [0; 0] [1; 1] NA NA NA
0 FLG_acc_ud_loc TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata [0; 0] [1; 1] NA NA NA
0 ICC_acc_ud_loc ICC values computed using a mixed effects model [0; 0.02) [0.02; 0.03) [0.03; 0.05) [0.05, 0.1) [0.1; 1]
0 NUM_con_con_contc Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible) [0; 1) ] NA
0 CAT_applicability Technical use only, please do not remove [1; 1] [2; 2] [3; 3] [4; 4] [5; 5]
0 CAT_error Technical use only, please do not remove [1; 1] [2; 2] [3; 3] [4; 4] [5; 5]
0 CAT_anamat Technical use only, please do not remove [1; 1] [2; 2] [3; 3] [4; 4] [5; 5]
0 CAT_indicator_or_descriptor Technical use only, please do not remove [1; 1] [2; 2] [3; 3] [4; 4] [5; 5]


The default grading ruleset used to classify the results is identified by the code GRADING_RULESET equal to 0. This can be read in the first column of the grading ruleset table above. Additional grading rulesets can be added to the table, using increasing numbers for GRADING RULESET (see below in how to customize rulesets). Since data quality is defined as the fitness for a certain purpose, also the grading rulesets depends on the purpose (ISO 2022). They may differ then from the one provided as default.

Hereafter you can find the column captions of the table and the relative definitions.

  • GRADING_RULESET, a number identifying the ruleset. The default ruleset is 0;
  • Indicator_metric, contains the indicator metrics present in dataquieR for which we want to define a classification. (If for example an indicator have multiple metrics, such as a flag FLG, a number NUM, or a percentage PCT, usually only one (i.e., the percentage) is listed here);
  • dqi_cat_1 to dqi_cat_5, contains the range in which a value must be contained to be assigned to the respective category, given the indicator metric.
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
indicator_metric Category
dqi_cat_1 Ok
dqi_cat_2 Unclear
dqi_cat_3 Moderate
dqi_cat_4 Important
dqi_cat_5 Critical


How to customize the ruleset

CASE 1: a customized ruleset that applies to all the variables

A customized ruleset can be specified by the user providing a new table “grading_rulesets” using prep_add_data_frames() before rendering the report.

prep_add_data_frames(grading_rulesets = "~/Documents/my_grading_rulesets.xlsx")

This new table should have the same columns as the default ruleset table seen above and the GRADING_RULESET with the number 0.

Hereafter you can find a complete example:

First you create a report, we use here the SHIP-based example data.

r1 <- dq_report2(study_data = "ship", 
                 meta_data_v2 = "ship_meta_v2", 
                 dimensions = NULL)

Then you modify the grading ruleset table that can be downloaded from the website with this link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx

For example you can modify it so that there are no ranges in the dqi_cat_1 corresponding to the category “OK”.

Then save the new table as my_grading_rulesets.xlsx.

Now you have to add the new grading rulesets to the dataquieR data frame storage area (see tutorial for more details), before rendering the report.

prep_add_data_frames(grading_rulesets = "~/my_grading_rulesets.xlsx")

Now you can render the report.

r1  # if you want to see it in the Viewer
print(r1, "~/Desktop/report_example") # if you want to save the report in a folder on the Desktop

CASE 2: different versions of grading rulesets for different variables

It is possible to assign different rulesets to specific variables. To do that, you have to:

  • add new rows with a different GRADING_RULESET number in the table grading_rulesets (e.g., a new set of GRADING_RULESET = 1; this can be a complete new table with different values or only a subset of indicator_metrics);

  • add a new column GRADING_RULESET in the item-level metadata, associating each variable to the number of the corresponding ruleset you want to use for that specific variable.

If no customized ruleset table is provided, the ruleset shipped with dataquieR will be used.

Hereafter there is an example with the SHIP-based example data, where we modified the grading ruleset only for a few variables:

First you need to create a new grading rule for the indicator metric ICC_acc_ud_loc in the table of “my_grading_rulesets.xlsx”, and save the table as "my_new_grading_rule_example2.xlsx". You can see the example of the new indicator metric below in yellow (the default version is the one in green).


You have to apply this new grading rule (in yellow in the image) only to the variables sbp1, sbp2, dbp1, dbp2. To do it you need to modify the item_level metadata sheet from the original metadata file for the SHIP-based example data. For example you can add a new column GRADING_RULESET with 1 for the variables you want to have the new grading rule.


After saving the new metadata file as "ship_meta_v2_new.xlsx" on the Desktop, you can create the report using the new version of the metadata file (because the item_level is now different from before).

r2 <- dq_report2(study_data = "ship", 
                 meta_data_v2 = "~/Desktop/ship_meta_v2_new.xlsx", 
                 dimensions = NULL)

Now add the new grading rulesets to the dataquieR data frame storage area before rendering the report.

prep_add_data_frames(grading_rulesets = "~/my_new_grading_rule_example2.xlsx")

Finally you can render the report and obtain the following result for “Variance proportion device”.

r2  # if you want to see it in the Viewer
print(r2, "~/Desktop/report_example_n2") # if you want to save the report in a folder on the Desktop


There you can see that Systolic and Diastolic blood pressures have the category Important (two variables with ICC<0.005) and Critical (two variables with ICC >0.005 ) based on the new GRADING_RULESET (1), whereas Body height is Unclear based on the old GRADING_RULESET (0).


How to customize the grading format

A grading format is a table containing the information on the RGB color codes (color) and text (label) used to represent the five data quality categories inside the report.

Here is the default grading format table used by dataquieR:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
category label color
1 Ok 33 102 172
2 Unclear 67 147 195
3 Moderate 227 186 20
4 Important 214 96 77
5 Critical 178 23 43


The desired color can be specified in different ways:

  • 3 decimal numbers separated by spaces, representing the RGB components of the colors (e.g., “227 186 20”);

  • hexadecimal color codes (e.g., “#rrggbb”); or

  • color names (e.g., “red”)

Please look here for more information.

You can modify colors and text providing a new table grading_formats using prep_add_data_frames() before rendering the report. For example:

prep_add_data_frames(grading_formats = "~/Documents/my_grading_formats.xlsx")


Summary of indicator metrics currently present in the grading ruleset of dataquieR

Data quality indicator metric acronym Data quality indicator metric Dimension - Domain Indicator Function Description
PCT_int_vfe_type Percentage Integrity - Value format error Data type mismatch int_datatype_matrix Percentage of observational units with expected and observed data type not matching
PCT_int_uenc Percentage Integrity - Value format error Inadmissible data format int_encoding_errors Percentage of observational units with text containing invalid characters with respect to the expected encoding
PCT_com_crm_mv Percentage Completeness - Crude missingness Crude missingness com_item_missingness Percentage of observational units with missing values (Nas, jumps, and missing codes)
PCT_com_qum_nonresp Percentage Completeness - Qualified missingness Item non-response rate com_qualified_item_missingness, com_qualified_segment_missingness Percentage of non-response rate see the explanation on qualified missingness labels from AAPOR
PCT_com_qum_refusal Percentage Completeness - Qualified missingness Refusal rate com_qualified_item_missingness, com_qualified_segment_missingness Percentage of refusal rate see the explanation on qualified missingness labels from AAPOR
PCT_con_rvv_icat Percentage Consistency - Range and value violations Inadmissible categorical values con_inadmissible_categorical Percentage of observational units with observed categories not matching any possible expected one
PCT_con_rvv_inum Percentage Consistency - Range and value violations Inadmissible numerical values con_limit_deviations Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata
PCT_con_rvv_unum Percentage Consistency - Range and value violations Uncertain numerical values con_limit_deviations Percentage of observational units with a numerical value outside the expected interval (plausible values) provided in the metadata
PCT_con_rvv_itdat Percentage Consistency - Range and value violations Inadmissible datetime values con_limit_deviations Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata
PCT_con_rvv_utdat Percentage Consistency - Range and value violations Uncertain datetime values con_limit_deviations Percentage of observational units with a date-time value outside the expected interval (plausible values) provided in the metadata
PCT_acc_ud_outlu Percentage Accuracy - Unexpected distributions Univariate outliers acc_univariate_outlier Percentage of observational units considered an outlier with respect to one variable
FLG_acc_ud_shape Flag Accuracy - Unexpected distributions Unexpected shape, unexpected scale acc_shape_or_scale TRUE/FALSE value indicating if the single intervals in the study data deviate significantly from a theoretical expected distribution
PCT_acc_ud_loc Percentage Accuracy - Unexpected distributions Unexpected location acc_margins Percentage of groups (e.g., examiners) with a marginal mean outside the possible range identified by the threshold value
FLG_acc_ud_prop Flag Accuracy - Unexpected distributions Unexpected proportion acc_distributions TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata
FLG_acc_ud_loc Percentage Accuracy - Unexpected distributions Unexpected location acc_distributions TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata
ICC_acc_ud_loc Variance proportion Accuracy - Unexpected distributions Unexpected location acc_varcomp ICC values computed using a mixed effects model
NUM_con_con_contc Number Consistency - Contradictions Logical contradictions con_contradictions_redcap Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible)
PCT_con_con_contu Percentage Consistency - Contradictions Empirical contradictions con_contradictions_redcap Percentage of observational units that are in conflict with the provided rule (implies that the combination of values in the rule is not impossible but unlikely)

Advanced users: modify options

A user can modify the options to specify the name of the table containing the grading rules, if that is different from “grading_rulesets”.

For example, let’s say you already have a sheet called mygrading in your metadata “mymeta.xlsx” (as in the image below), containing the grading rules for the report.

First you need to create a report with that metadata.

r1 <- dq_report2(study_data = "study_data", 
                 meta_data_v2 = "~/Desktop/mymeta.xlsx", 
                 dimensions = NULL)

If you continue immediately after creating the report, you do not need to re-add the metadata in the dataquieR data frame storage area. Otherwise you need to reload the metadata with prep_load_workbook_like_file("~/Desktop/mymeta.xlsx")

Then you need to modify the options to specify the name of the table you want to use as grading rules, in this example “mygrading”.

options("dataquieR.grading_rulesets" =
          "mygrading")

Now you can render the report.

r1 #it will be visible in the Viewer in RStudio

Back to Overview