A grading ruleset is a table containing information to classify
results. The results are represented by a set of indicator metrics,
i.e. a numeric value that correlates with the size of a data quality
problem related to one specific data quality indicator (the larger the
number, the larger the issue or vice versa). Examples of these results
or indicator metrics are the percentage of missing values, or the ICC
value. These indicator metric values are classified in a maximum of 5
categories (dqi_cat_1
to dqi_cat_5
) defined by
different range of values. Category 1 (dqi_cat_1
)
corresponds to no data quality issues, whereas category 5 indicates
critical data quality issues.
Here is the grading ruleset table used by dataquieR (you can download it using the link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx):
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
GRADING_RULESET | indicator_metric | Description | dqi_cat_1 | dqi_cat_2 | dqi_cat_3 | dqi_cat_4 | dqi_cat_5 | ||
---|---|---|---|---|---|---|---|---|---|
0 | PCT_int_vfe_type | Percentage of observational units with expected and observed data type not matching | [0; 0] | NA | NA | (0; 1) | [1; 100] | ||
0 | PCT_int_uenc | Percentage of observational units with text containing invalid characters with respect to the expected encoding | [0; 0] | (0; 100] | NA | NA | NA | ||
0 | PCT_com_crm_mv | Percentage of observational units with missing values (Nas, jumps, and missing codes) | [0; 1) | [1; 100] | NA | NA | NA | ||
0 | PCT_com_qum_nonresp | Percentage of non-response rate | [0; 1) | [1; 20) | [20; 100] | NA | NA | ||
0 | PCT_com_qum_refusal | Percentage of refusal rate | [0; 1) | [1; 20) ] | NA | NA | NA | ||
0 | PCT_con_rvv_inum | Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata | [0; 0] | NA | (0; 2) | [2; 5) ] | NA | NA | NA |
0 | PCT_con_rvv_itdat | Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata | [0; 0] | NA | (0; 2) | [2; 5) ] | NA | NA | NA |
0 | PCT_acc_ud_outlu | Percentage of observational units considered an outlier with respect to one variable | [0; 0] | (0; 2) | [2; 5) | [5; 10) ] | NA | NA | NA |
0 | FLG_acc_ud_prop | TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata | [0; 0] | [1; 1] | NA | NA | NA | ||
0 | FLG_acc_ud_loc | TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata | [0; 0] | [1; 1] | NA | NA | NA | ||
0 | ICC_acc_ud_loc | ICC values computed using a mixed effects model | [0; 0.02) | [0.02; 0.03) | [0.03; 0.05) | [0.05, 0.1) | [0.1; 1] | ||
0 | NUM_con_con_contc | Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible) | [0; 1) ] | NA | |||||
0 | CAT_applicability | Technical use only, please do not remove | [1; 1] | [2; 2] | [3; 3] | [4; 4] | [5; 5] | ||
0 | CAT_error | Technical use only, please do not remove | [1; 1] | [2; 2] | [3; 3] | [4; 4] | [5; 5] | ||
0 | CAT_anamat | Technical use only, please do not remove | [1; 1] | [2; 2] | [3; 3] | [4; 4] | [5; 5] | ||
0 | CAT_indicator_or_descriptor | Technical use only, please do not remove | [1; 1] | [2; 2] | [3; 3] | [4; 4] | [5; 5] |
The default grading ruleset used to classify the results is
identified by the code GRADING_RULESET
equal to 0. This can
be read in the first column of the grading ruleset table above.
Additional grading rulesets can be added to the table, using increasing
numbers for GRADING RULESET
(see below in how to customize
rulesets). Since data quality is defined as the fitness for a certain
purpose, also the grading rulesets depends on the purpose
(ISO 2022). They may differ then from
the one provided as default.
Hereafter you can find the column captions of the table and the relative definitions.
GRADING_RULESET
, a number identifying the ruleset. The
default ruleset is 0;Indicator_metric
, contains the indicator metrics
present in dataquieR
for which we want to define a
classification. (If for example an indicator have multiple metrics, such
as a flag FLG
, a number NUM
, or a percentage
PCT
, usually only one (i.e., the percentage) is listed
here);dqi_cat_1
to dqi_cat_5
, contains the range
in which a value must be contained to be assigned to the respective
category, given the indicator metric.## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
indicator_metric | Category |
---|---|
dqi_cat_1 | Ok |
dqi_cat_2 | Unclear |
dqi_cat_3 | Moderate |
dqi_cat_4 | Important |
dqi_cat_5 | Critical |
A customized ruleset can be specified by the user providing a new
table “grading_rulesets” using prep_add_data_frames()
before rendering the report.
prep_add_data_frames(grading_rulesets = "~/Documents/my_grading_rulesets.xlsx")
This new table should have the same columns as the default ruleset
table seen above and the GRADING_RULESET
with the number
0.
Hereafter you can find a complete example:
First you create a report, we use here the SHIP-based example data.
r1 <- dq_report2(study_data = "ship",
meta_data_v2 = "ship_meta_v2",
dimensions = NULL)
Then you modify the grading ruleset table that can be downloaded from the website with this link https://dataquality.qihs.uni-greifswald.de/extdata/grading_rulesets.xlsx
For example you can modify it so that there are no ranges in the
dqi_cat_1
corresponding to the category “OK”.
Then save the new table as my_grading_rulesets.xlsx
.
Now you have to add the new grading rulesets to the dataquieR data frame storage area (see tutorial for more details), before rendering the report.
prep_add_data_frames(grading_rulesets = "~/my_grading_rulesets.xlsx")
Now you can render the report.
r1 # if you want to see it in the Viewer
print(r1, "~/Desktop/report_example") # if you want to save the report in a folder on the Desktop
It is possible to assign different rulesets to specific variables. To do that, you have to:
add new rows with a different GRADING_RULESET
number
in the table grading_rulesets
(e.g., a new set of
GRADING_RULESET
= 1; this can be a complete new table with
different values or only a subset of indicator_metrics);
add a new column GRADING_RULESET
in the item-level
metadata, associating each variable to the number of the corresponding
ruleset you want to use for that specific variable.
If no customized ruleset table is provided, the ruleset shipped with dataquieR will be used.
Hereafter there is an example with the SHIP-based example data, where we modified the grading ruleset only for a few variables:
First you need to create a new grading rule for the indicator metric
ICC_acc_ud_loc
in the table of “my_grading_rulesets.xlsx”,
and save the table as "my_new_grading_rule_example2.xlsx"
.
You can see the example of the new indicator metric below in yellow (the
default version is the one in green).
You have to apply this new grading rule (in yellow in the image) only
to the variables sbp1
, sbp2
,
dbp1
, dbp2
. To do it you need to modify the
item_level metadata
sheet from the original metadata file
for the SHIP-based example data. For example you can add a new column
GRADING_RULESET
with 1
for the variables you
want to have the new grading rule.
After saving the new metadata file as
"ship_meta_v2_new.xlsx"
on the Desktop, you can create the
report using the new version of the metadata file (because the
item_level
is now different from before).
r2 <- dq_report2(study_data = "ship",
meta_data_v2 = "~/Desktop/ship_meta_v2_new.xlsx",
dimensions = NULL)
Now add the new grading rulesets to the dataquieR data frame storage area before rendering the report.
prep_add_data_frames(grading_rulesets = "~/my_new_grading_rule_example2.xlsx")
Finally you can render the report and obtain the following result for “Variance proportion device”.
r2 # if you want to see it in the Viewer
print(r2, "~/Desktop/report_example_n2") # if you want to save the report in a folder on the Desktop
There you can see that Systolic and Diastolic blood pressures have
the category Important
(two variables with ICC<0.005)
and Critical
(two variables with ICC >0.005 ) based on
the new GRADING_RULESET (1), whereas Body height is Unclear
based on the old GRADING_RULESET (0).
A grading format is a table containing the information on the RGB
color codes (color
) and text (label
) used to
represent the five data quality categories inside the report.
Here is the default grading format table used by dataquieR:
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
category | label | color |
---|---|---|
1 | Ok | 33 102 172 |
2 | Unclear | 67 147 195 |
3 | Moderate | 227 186 20 |
4 | Important | 214 96 77 |
5 | Critical | 178 23 43 |
The desired color can be specified in different ways:
3 decimal numbers separated by spaces, representing the RGB components of the colors (e.g., “227 186 20”);
hexadecimal color codes (e.g., “#rrggbb”); or
color names (e.g., “red”)
Please look here for more information.
You can modify colors and text providing a new table
grading_formats
using prep_add_data_frames()
before rendering the report. For example:
prep_add_data_frames(grading_formats = "~/Documents/my_grading_formats.xlsx")
Data quality indicator metric acronym | Data quality indicator metric | Dimension - Domain | Indicator | Function | Description |
---|---|---|---|---|---|
PCT_int_vfe_type | Percentage | Integrity - Value format error | Data type mismatch | int_datatype_matrix | Percentage of observational units with expected and observed data type not matching |
PCT_int_uenc | Percentage | Integrity - Value format error | Inadmissible data format | int_encoding_errors | Percentage of observational units with text containing invalid characters with respect to the expected encoding |
PCT_com_crm_mv | Percentage | Completeness - Crude missingness | Crude missingness | com_item_missingness | Percentage of observational units with missing values (Nas, jumps, and missing codes) |
PCT_com_qum_nonresp | Percentage | Completeness - Qualified missingness | Item non-response rate | com_qualified_item_missingness, com_qualified_segment_missingness | Percentage of non-response rate see the explanation on qualified missingness labels from AAPOR |
PCT_com_qum_refusal | Percentage | Completeness - Qualified missingness | Refusal rate | com_qualified_item_missingness, com_qualified_segment_missingness | Percentage of refusal rate see the explanation on qualified missingness labels from AAPOR |
PCT_con_rvv_icat | Percentage | Consistency - Range and value violations | Inadmissible categorical values | con_inadmissible_categorical | Percentage of observational units with observed categories not matching any possible expected one |
PCT_con_rvv_inum | Percentage | Consistency - Range and value violations | Inadmissible numerical values | con_limit_deviations | Percentage of observational units with a numerical value outside the expected interval (admissible values) provided in the metadata |
PCT_con_rvv_unum | Percentage | Consistency - Range and value violations | Uncertain numerical values | con_limit_deviations | Percentage of observational units with a numerical value outside the expected interval (plausible values) provided in the metadata |
PCT_con_rvv_itdat | Percentage | Consistency - Range and value violations | Inadmissible datetime values | con_limit_deviations | Percentage of observational units with a date-time value outside the expected interval (admissible values) provided in the metadata |
PCT_con_rvv_utdat | Percentage | Consistency - Range and value violations | Uncertain datetime values | con_limit_deviations | Percentage of observational units with a date-time value outside the expected interval (plausible values) provided in the metadata |
PCT_acc_ud_outlu | Percentage | Accuracy - Unexpected distributions | Univariate outliers | acc_univariate_outlier | Percentage of observational units considered an outlier with respect to one variable |
FLG_acc_ud_shape | Flag | Accuracy - Unexpected distributions | Unexpected shape, unexpected scale | acc_shape_or_scale | TRUE/FALSE value indicating if the single intervals in the study data deviate significantly from a theoretical expected distribution |
PCT_acc_ud_loc | Percentage | Accuracy - Unexpected distributions | Unexpected location | acc_margins | Percentage of groups (e.g., examiners) with a marginal mean outside the possible range identified by the threshold value |
FLG_acc_ud_prop | Flag | Accuracy - Unexpected distributions | Unexpected proportion | acc_distributions | TRUE/FALSE value indicating if the frequencies of the categories of a variable are inside an expected range of possible values provided in the metadata |
FLG_acc_ud_loc | Percentage | Accuracy - Unexpected distributions | Unexpected location | acc_distributions | TRUE/FALSE value indicating if the mean/median of a variable is inside an expected range provided in the metadata |
ICC_acc_ud_loc | Variance proportion | Accuracy - Unexpected distributions | Unexpected location | acc_varcomp | ICC values computed using a mixed effects model |
NUM_con_con_contc | Number | Consistency - Contradictions | Logical contradictions | con_contradictions_redcap | Number of observational units that are in conflict with the provided rule (implies that the combination of values in the rule are impossible) |
PCT_con_con_contu | Percentage | Consistency - Contradictions | Empirical contradictions | con_contradictions_redcap | Percentage of observational units that are in conflict with the provided rule (implies that the combination of values in the rule is not impossible but unlikely) |
A user can modify the options to specify the name of the table containing the grading rules, if that is different from “grading_rulesets”.
For example, let’s say you already have a sheet called
mygrading
in your metadata “mymeta.xlsx” (as in the image
below), containing the grading rules for the report.
First you need to create a report with that metadata.
r1 <- dq_report2(study_data = "study_data",
meta_data_v2 = "~/Desktop/mymeta.xlsx",
dimensions = NULL)
If you continue immediately after creating the report, you do not
need to re-add the metadata in the dataquieR data frame storage area.
Otherwise you need to reload the metadata with
prep_load_workbook_like_file("~/Desktop/mymeta.xlsx")
Then you need to modify the options to specify the name of the table you want to use as grading rules, in this example “mygrading”.
options("dataquieR.grading_rulesets" =
"mygrading")
Now you can render the report.
r1 #it will be visible in the Viewer in RStudio