The data source for this data quality assessment is an ongoing sero-prevalence study of the Study of Health in Pomerania (SHIP). For confidentiality reasons the raw data are not shown here.
The source for the metadata has been published in an OPAL repository for COVID-19 studies. As these data follow standardized annotation only a few steps were required to transform the data into a dataquieR
conforming format.
R code for the respective transformation is made available upon request. The following metadata adhere to the conventions of the dataquieR
R package.
In the SHIP-C19 study data missing codes are used to qualify missing data. These codes were not specified in public access data dictionary. However, upon request the use of following codes were mentioned:
These codes were manually added to the DD.
ship_meta <- openxlsx::read.xlsx(xlsxFile = "C:/Users/richtera/Documents/git_projects/dfg_website/_data/SHIP-C19/shipc19_dd_dataquieR_m.xlsx",
sheet = 1)
The respective code labels are saved in a dataframe.
None of the data element names in the study data refer to VAR_NAMES
in the metadata.
length(setdiff(names(ship), as.character(ship_meta$VAR_NAMES)))
## [1] 33
Fix: as a simple prefix has been added this issue can easily be fixed using:
ship_names <- names(ship)
ship_names <- gsub("^saq_covid_loop_", "q", perl = TRUE, ship_names)
names(ship) <- ship_names
A check on whether the study data are fully documented in the metadata shows that two variables are still not found in the metadata.
setdiff(names(ship), as.character(ship_meta$VAR_NAMES))
## [1] "id" "intro_beg"
This has been fixed using the dataquieR::prep_add_to_meta()
function to add variables characteristics to the metadata.
ship_meta_m <- dataquieR::prep_add_to_meta(VAR_NAMES = "id",
DATA_TYPE = "integer",
LABEL = NA,
VALUE_LABELS = NA,
KEY_STUDY_SEGMENT = "Intro",
meta_data = ship_meta)
ship_meta_m <- dataquieR::prep_add_to_meta(VAR_NAMES = "intro_beg",
DATA_TYPE = "string",
LABEL = NA,
VALUE_LABELS = NA,
KEY_STUDY_SEGMENT = "Intro",
meta_data = ship_meta_m)
shipc19_app <- dataquieR::pro_applicability_matrix(study_data = ship,
meta_data = ship_meta_m)
## Warning: In dataquieR::pro_applicability_matrix: Lost 2.9% of the meta data because of missing/not assignable study-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m)
## Found meta data for the following variables not found in the study data: "q12"
One variable (q12) has been specified in the metadata but was not found in the study data. It is possible that the respective variable has not been exported or that another error caused this issue. For further analyses the respective metadata are excluded.
shipc19_app$ApplicabilityPlot
UM <- dataquieR::com_unit_missingness(study_data = ship,
meta_data = ship_meta_m,
id_vars = "id")
Unit missingness is found in 34 records which corresponds to 68%.
SM <- dataquieR::com_segment_missingness(study_data = ship,
meta_data = ship_meta_m,
threshold_value = 80,
direction = "high")
## Warning: In dataquieR::com_segment_missingness: Specified VARIABLE_ROLE(s) were not found in metadata. All variables are included here.
## > dataquieR::com_segment_missingness(study_data = ship, meta_data = ship_meta_m,
## threshold_value = 80, direction = "high")
SM$SummaryPlot
## $SummaryPlot
IM <- dataquieR::com_item_missingness(study_data = ship,
meta_data = ship_meta_m,
cause_label_df = shipc19_mc,
show_causes = TRUE,
threshold_value = 32,
include_sysmiss = TRUE)
## Warning: In dataquieR::com_item_missingness: Setting suppressWarnings to its default FALSE
## > dataquieR::com_item_missingness(study_data = ship, meta_data = ship_meta_m,
## cause_label_df = shipc19_mc, show_causes = TRUE, threshold_value = 32,
IM$SummaryPlot
kable(IM$SummaryTable, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Variables | Observations N | Sysmiss N (%) | Datavalues N (%) | Missing codes N (%) | Jumps N (%) | Measurements N (%) | GRADING |
---|---|---|---|---|---|---|---|
q01 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q02 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q02a | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02b | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02c | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02d | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02e | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02f | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02g | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02h | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02i | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02j | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02k | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02l | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q02m | 50 | 3 (6) | 47 (94) | 31 (62) | 16 (32) | 0 (0) | 1 |
q03 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q03a | 50 | 3 (6) | 47 (94) | 31 (62) | 10 (20) | 6 (15) | 1 |
q03b | 50 | 34 (68) | 16 (32) | 0 (0) | 16 (32) | 0 (0) | 1 |
q04 | 50 | 3 (6) | 47 (94) | 32 (64) | 0 (0) | 15 (30) | 1 |
q04a | 50 | 34 (68) | 16 (32) | 1 (2) | 15 (30) | 0 (0) | 1 |
q05 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q06 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q07 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q07a | 50 | 3 (6) | 47 (94) | 31 (62) | 13 (26) | 3 (8.11) | 1 |
q07b | 50 | 3 (6) | 47 (94) | 34 (68) | 13 (26) | 0 (0) | 1 |
q08a | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q08b | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q08c | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q09 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q10 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
q11 | 50 | 3 (6) | 47 (94) | 31 (62) | 0 (0) | 16 (32) | 0 |
id | 50 | 0 (0) | 50 (100) | 0 (0) | 0 (0) | 50 (100) | 0 |
intro_beg | 50 | 34 (68) | 16 (32) | 0 (0) | 0 (0) | 16 (32) | 0 |
IAC <- dataquieR::con_inadmissible_categorical(study_data = ship,
meta_data = ship_meta_m)
## Warning: In dataquieR::con_inadmissible_categorical: All variables with VALUE_LABELS in the metadata are used.
## > dataquieR::con_inadmissible_categorical(study_data = ship, meta_data = ship_meta_m)
kable(IAC$SummaryTable, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Variables | OBSERVED_CATEGORIES | DEFINED_CATEGORIES | NON_MATCHING | NON_MATCHING_N | GRADING |
---|---|---|---|---|---|
q01 | 2, 3 | 1, 2, 3, 4, 5 | 0 | 0 | |
q02 | 0 | 0, 1 | 0 | 0 | |
q02a | 0, 1 | 0 | 0 | ||
q02b | 0, 1 | 0 | 0 | ||
q02c | 0, 1 | 0 | 0 | ||
q02d | 0, 1 | 0 | 0 | ||
q02e | 0, 1 | 0 | 0 | ||
q02f | 0, 1 | 0 | 0 | ||
q02g | 0, 1 | 0 | 0 | ||
q02h | 0, 1 | 0 | 0 | ||
q02i | 0, 1 | 0 | 0 | ||
q02j | 0, 1 | 0 | 0 | ||
q02k | 0, 1 | 0 | 0 | ||
q02l | 0, 1 | 0 | 0 | ||
q02m | 0, 1 | 0 | 0 | ||
q03 | 1, 0 | 0, 1 | 0 | 0 | |
q03a | 0 | 0, 1 | 0 | 0 | |
q04 | 0 | 0, 1 | 0 | 0 | |
q05 | 0, 1 | 0, 1, 2 | 0 | 0 | |
q06 | 0, 1 | 0, 1, 2 | 0 | 0 | |
q07 | 0, 1 | 0, 1 | 0 | 0 | |
q07a | 1 | 0, 1 | 0 | 0 | |
q07b | 0, 1 | 0 | 0 | ||
q08a | 2, 3, 1, 4 | 1, 2, 3, 4, 5 | 0 | 0 | |
q08b | 2, 1 | 1, 2, 3, 4, 5 | 0 | 0 | |
q08c | 1, 2 | 1, 2, 3, 4, 5 | 0 | 0 | |
q09 | 3, 2, 1, 4 | 1, 2, 3, 4, 5 | 0 | 0 | |
q10 | 3, 2, 4 | 1, 2, 3, 4, 5 | 0 | 0 | |
q11 | 3, 4, 2 | 1, 2, 3, 4, 5 | 0 | 0 |
DIS <- dataquieR::acc_distributions(study_data = ship,
meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: All variables defined to be integer or float in the metadata are used
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: Variables q02a, q02b, q02c, q02d, q02e, q02f, q02g, q02h, q02i, q02j, q02k, q02l, q02m, q07b contain NAs only and will be removed from analyses.
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)
## Warning: In dataquieR::acc_distributions: Variables q02, q03a, q04, q07a contain only one value and will be removed from analyses.
## > dataquieR::acc_distributions(study_data = ship, meta_data = ship_meta_m)
Distributional plots for all data including true measurements:
This example of a data quality report of a Covid-19 use case provides limited information due to two reasons:
For example, no short LABELS
are defined, only long labels denote the whole item questions:
Defining shorter labels - which has been done in the following example - increases the interpretability of such reports considerably:
shipc19_app <- dataquieR::pro_applicability_matrix(study_data = ship,
meta_data = ship_meta_m2,
label_col = LABEL)
## Warning: In dataquieR::pro_applicability_matrix: Lost 6.1% of the study data because of missing/not assignable meta-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m2,
## label_col = LABEL)
## Did not find any meta data for the following variables from the study data: "id", "intro_beg"
## Warning: In dataquieR::pro_applicability_matrix: Lost 3.1% of the meta data because of missing/not assignable study-data
## > dataquieR::pro_applicability_matrix(study_data = ship, meta_data = ship_meta_m2,
## label_col = LABEL)
## Found meta data for the following variables not found in the study data: "q12"
shipc19_app$ApplicabilityPlot