dataquieR
This tutorial introduces the creation of data quality reports in R with dataquieR.
Creating reports requires the appropriate setup of study data and metadata, as shown in the figure below:
We can load the synthetic example data from dataquieR via the following:
sd1 <- prep_get_data_frame("study_data")
This example study data has 3000 observations and 53 variables:
sd1
v00000 | v00001 | v00002 | v00003 | v00004 | v00005 | v01003 | v01002 | v00103 | v00006 |
---|---|---|---|---|---|---|---|---|---|
3 | LEIIX715 | 0 | 49 | 127 | 77 | 49 | 0 | 40-49 | 3.8 |
1 | QHNKM456 | 0 | 47 | 114 | 76 | 47 | 0 | 40-49 | 1.9 |
1 | HTAOB589 | 0 | 50 | 114 | 71 | 50 | 0 | 50-59 | 0.8 |
5 | HNHFV585 | 0 | 48 | 120 | 65 | 48 | 0 | 40-49 | 3.8 |
1 | UTDLS949 | 0 | 56 | 119 | 78 | 56 | 0 | 50-59 | 4.1 |
5 | YQFGE692 | 1 | 47 | 133 | 81 | 47 | 1 | 40-49 | 9.5 |
1 | AVAEH932 | 0 | 53 | 114 | 78 | 53 | 0 | 50-59 | 5.0 |
3 | QDOPT378 | 1 | 48 | 116 | 86 | 48 | 1 | 40-49 | 9.6 |
3 | BMOAK786 | 0 | 44 | 115 | 71 | 44 | 0 | 40-49 | 2.0 |
5 | ZDKNF462 | 0 | 50 | 116 | 74 | 50 | 0 | 50-59 | 2.4 |
We can see that the study data variables have abstract names
(e.g. v00001, v00002
). Hence, the appropriate labels must
be mapped from the metadata. Besides all variables' data types and
labels, the metadata stores further expected characteristics and static
information about the study data.
We can read in the example metadata via the following:
prep_load_workbook_like_file("meta_data_v2")
md1 <- prep_get_data_frame("item_level")
cil <- prep_get_data_frame("cross-item_level")
sl <- prep_get_data_frame("segment_level")
VAR_NAMES | LABEL | DATA_TYPE | VALUE_LABELS | STANDARDIZED_VOCABULARY_TABLE | MISSING_LIST_TABLE | HARD_LIMITS | DETECTION_LIMITS |
---|---|---|---|---|---|---|---|
v00000 | CENTER_0 | integer | 1 = Berlin | 2 = Hamburg | 3 = Leipzig | 4 = Cologne | 5 = Munich | NA | NA | NA | NA |
v00001 | PSEUDO_ID | string | NA | NA | NA | NA | NA |
v00002 | SEX_0 | integer | 0 = females | 1 = males | NA | NA | NA | NA |
v00003 | AGE_0 | integer | NA | NA | NA | [18;Inf) | NA |
v00103 | AGE_GROUP_0 | string | NA | NA | NA | NA | NA |
v01003 | AGE_1 | integer | NA | NA | NA | [18;Inf) | NA |
v01002 | SEX_1 | integer | 0 = females | 1 = males | NA | NA | NA | NA |
v10000 | PART_STUDY | integer | 0 = no | 1 = yes | NA | NA | NA | NA |
v00004 | SBP_0 | float | NA | NA | missing_table | [80;180] | [0;265] |
v00005 | DBP_0 | float | NA | NA | missing_table | [50;Inf) | [0;265] |
VARIABLE_LIST | CHECK_LABEL | CONTRADICTION_TERM | CONTRADICTION_TYPE | MULTIVARIATE_OUTLIER_CHECKTYPE | N_RULES | ASSOCIATION_RANGE | ASSOCIATION_METRIC |
---|---|---|---|---|---|---|---|
NA | Age follow-up | [AGE_1] < [AGE_0] | LOGICAL | NA | NA | NA | NA |
NA | Sex follow-up | [SEX_1] <> [SEX_0] | LOGICAL | NA | NA | NA | NA |
NA | Education follow-up | [EDUCATION_1] < [EDUCATION_0] | LOGICAL | NA | NA | NA | NA |
NA | Nutrition inconsistency vegetarian | [EATING_PREFS_0] = “vegetarian” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) | LOGICAL | NA | NA | NA | NA |
NA | Nutrition inconsistency vegan | [EATING_PREFS_0] = “vegan” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) | LOGICAL | NA | NA | NA | NA |
NA | Nutrition inconsistency | [EATING_PREFS_0] = “none” and [MEAT_CONS_0] = “never” | EMPIRICAL | NA | NA | NA | NA |
NA | Non-smokers inconsistency | [SMOKING_0] = “no” and ([SMOKE_SHOP_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) | EMPIRICAL | NA | NA | NA | NA |
NA | Smokers inconsistency | ([SMOKING_0] = “yes”) and ([SMOKE_SHOP_0] = “never”) | EMPIRICAL | NA | NA | NA | NA |
NA | Blood pressure false cuff | [ARM_CIRC_DISC_0] <> [ARM_CUFF_0] | LOGICAL | NA | NA | NA | NA |
NA | Pregnancy at high age | [PREGNANT_0] = “yes” and [AGE_0] > 55 | EMPIRICAL | NA | NA | NA | NA |
STUDY_SEGMENT | SEGMENT_RECORD_COUNT | SEGMENT_ID_TABLE | SEGMENT_RECORD_CHECK | SEGMENT_ID_VARS | SEGMENT_UNIQUE_ROWS | |
---|---|---|---|---|---|---|
1 | PART_STUDY | 3000 | expected_id | exact | v00001 | false |
2 | PART_PHYS_EXAM | 3000 | expected_id | exact | v00001 | false |
3 | PART_LAB | 3000 | expected_id | exact | v00001 | false |
4 | PART_INTERVIEW | 3000 | expected_id | exact | v00001 | false |
5 | PART_QUESTIONNAIRE | 3000 | expected_id | exact | v00001 | false |
NA | NA | NA | NA | NA | NA | NA |
NA.1 | NA | NA | NA | NA | NA | NA |
NA.2 | NA | NA | NA | NA | NA | NA |
NA.3 | NA | NA | NA | NA | NA | NA |
NA.4 | NA | NA | NA | NA | NA | NA |
For more information on the synthetic example data and metadata, see Documentation of simulated data including errors and data quality issues and Definition and use of metadata.
We can create a default report using the dq_report2()
function, which requires only the data and metadata loaded above with
prep_load_workbook_like_file()
:
dq_report2(study_data = sd1) # metadata will be found, if prep_load_workbook_like_file did run before.
The animation below shows a quick workflow for reporting data quality with dataquieR:
r <- dq_report2("ship", meta_data_v2 = "ship_meta_v2")
dir.create("report_v2/")
print(r, dir = "report_v2/")
This example uses data from the Study of Health in Pomerania (SHIP)
project, which is also included in dataquieR. You can
see the example report generated by dq_report2()
here.
The full code shown in the animation to produce a report is given here:
# --------------------------------------------------------------------------------------------------
# D A T A Q U A L I T Y I N E P I D E M I O L O G I C A L R E S E A R C H
#
# == dataquieR
#
# dq_report2() eases the generation of data quality reports as it automatically calls dataquieR functions
#
#
# Installation/Further Information -----------------------------------------------------------------
#
# Please see our website:
# https://dataquality.qihs.uni-greifswald.de/
#
# (install dataquieR from CRAN using)
#
# or
#
# currently, you should install the development version as described
# on https://dataquality.ship-med.uni-greifswald.de/DownloadR.html
install.packages("dataquieR")
# load the package
library(dataquieR)
# data ---------------------------------------------------------------------------------------------
# Study of Health in Pomerania example data
sd1 <- system.file("extdata", "ship.RDS", package = "dataquieR")
print(sd1)
# metadata
md1 <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
print(md1)
# dq_report2() - a crude approach -------------------------------------------------------------------
my_dq_report <- dq_report2(study_data = sd1,
meta_data_v2 = md1,
label_col = LABEL)
# view the results
print(my_dq_report)
The function dq_report2()
and the print()
for such reports can manage further arguments and settings. However,
this sparse version is a good start to gaining insight into the data and
may serve as the base to tailor more specific reports.