This tutorial introduces the creation of data quality reports in R with dataquieR.

Loading data and metadata

Creating reports requires the appropriate setup of study data and metadata, as shown in the figure below:

We can load the synthetic example data from dataquieR via the following:

sd1 <- prep_get_data_frame("study_data")

This example study data has 3000 observations and 53 variables:

sd1
v00000 v00001 v00002 v00003 v00004 v00005 v01003 v01002 v00103 v00006
3 LEIIX715 0 49 127 77 49 0 40-49 3.8
1 QHNKM456 0 47 114 76 47 0 40-49 1.9
1 HTAOB589 0 50 114 71 50 0 50-59 0.8
5 HNHFV585 0 48 120 65 48 0 40-49 3.8
1 UTDLS949 0 56 119 78 56 0 50-59 4.1
5 YQFGE692 1 47 133 81 47 1 40-49 9.5
1 AVAEH932 0 53 114 78 53 0 50-59 5.0
3 QDOPT378 1 48 116 86 48 1 40-49 9.6
3 BMOAK786 0 44 115 71 44 0 40-49 2.0
5 ZDKNF462 0 50 116 74 50 0 50-59 2.4


We can see that the study data variables have abstract names (e.g. v00001, v00002). Hence, the appropriate labels must be mapped from the metadata. Besides all variables' data types and labels, the metadata stores further expected characteristics and static information about the study data.

We can read in the example metadata via the following:

prep_load_workbook_like_file("meta_data_v2")
md1 <- prep_get_data_frame("item_level")
cil <- prep_get_data_frame("cross-item_level")
sl <- prep_get_data_frame("segment_level")
VAR_NAMES LABEL DATA_TYPE VALUE_LABELS STANDARDIZED_VOCABULARY_TABLE MISSING_LIST_TABLE HARD_LIMITS DETECTION_LIMITS
v00000 CENTER_0 integer 1 = Berlin | 2 = Hamburg | 3 = Leipzig | 4 = Cologne | 5 = Munich NA NA NA NA
v00001 PSEUDO_ID string NA NA NA NA NA
v00002 SEX_0 integer 0 = females | 1 = males NA NA NA NA
v00003 AGE_0 integer NA NA NA [18;Inf) NA
v00103 AGE_GROUP_0 string NA NA NA NA NA
v01003 AGE_1 integer NA NA NA [18;Inf) NA
v01002 SEX_1 integer 0 = females | 1 = males NA NA NA NA
v10000 PART_STUDY integer 0 = no | 1 = yes NA NA NA NA
v00004 SBP_0 float NA NA missing_table [80;180] [0;265]
v00005 DBP_0 float NA NA missing_table [50;Inf) [0;265]
VARIABLE_LIST CHECK_LABEL CONTRADICTION_TERM CONTRADICTION_TYPE MULTIVARIATE_OUTLIER_CHECKTYPE N_RULES ASSOCIATION_RANGE ASSOCIATION_METRIC
NA Age follow-up [AGE_1] < [AGE_0] LOGICAL NA NA NA NA
NA Sex follow-up [SEX_1] <> [SEX_0] LOGICAL NA NA NA NA
NA Education follow-up [EDUCATION_1] < [EDUCATION_0] LOGICAL NA NA NA NA
NA Nutrition inconsistency vegetarian [EATING_PREFS_0] = “vegetarian” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) LOGICAL NA NA NA NA
NA Nutrition inconsistency vegan [EATING_PREFS_0] = “vegan” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) LOGICAL NA NA NA NA
NA Nutrition inconsistency [EATING_PREFS_0] = “none” and [MEAT_CONS_0] = “never” EMPIRICAL NA NA NA NA
NA Non-smokers inconsistency [SMOKING_0] = “no” and ([SMOKE_SHOP_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) EMPIRICAL NA NA NA NA
NA Smokers inconsistency ([SMOKING_0] = “yes”) and ([SMOKE_SHOP_0] = “never”) EMPIRICAL NA NA NA NA
NA Blood pressure false cuff [ARM_CIRC_DISC_0] <> [ARM_CUFF_0] LOGICAL NA NA NA NA
NA Pregnancy at high age [PREGNANT_0] = “yes” and [AGE_0] > 55 EMPIRICAL NA NA NA NA
STUDY_SEGMENT SEGMENT_RECORD_COUNT SEGMENT_ID_TABLE SEGMENT_RECORD_CHECK SEGMENT_ID_VARS SEGMENT_UNIQUE_ROWS
1 PART_STUDY 3000 expected_id exact v00001 false
2 PART_PHYS_EXAM 3000 expected_id exact v00001 false
3 PART_LAB 3000 expected_id exact v00001 false
4 PART_INTERVIEW 3000 expected_id exact v00001 false
5 PART_QUESTIONNAIRE 3000 expected_id exact v00001 false
NA NA NA NA NA NA NA
NA.1 NA NA NA NA NA NA
NA.2 NA NA NA NA NA NA
NA.3 NA NA NA NA NA NA
NA.4 NA NA NA NA NA NA


For more information on the synthetic example data and metadata, see Documentation of simulated data including errors and data quality issues and Definition and use of metadata.

Generating a report

We can create a default report using the dq_report2() function, which requires only the data and metadata loaded above with prep_load_workbook_like_file():

dq_report2(study_data = sd1) # metadata will be found, if prep_load_workbook_like_file did run before.

Minimal workflow example

The animation below shows a quick workflow for reporting data quality with dataquieR:

r <- dq_report2("ship", meta_data_v2 = "ship_meta_v2")
dir.create("report_v2/")
print(r, dir = "report_v2/")

This example uses data from the Study of Health in Pomerania (SHIP) project, which is also included in dataquieR. You can see the example report generated by dq_report2() here.

Example code

The full code shown in the animation to produce a report is given here:

# --------------------------------------------------------------------------------------------------
# D A T A    Q U A L I T Y   I N    E P I D E M I O L O G I C A L    R E S E A R C H
#
# == dataquieR
#
# dq_report2() eases the generation of data quality reports as it automatically calls dataquieR functions
# 
#
# Installation/Further Information -----------------------------------------------------------------
#
# Please see our website:
# https://dataquality.qihs.uni-greifswald.de/
#
# (install dataquieR from CRAN using)
#
# or
# 
# currently, you should install the development version as described
# on https://dataquality.ship-med.uni-greifswald.de/DownloadR.html

install.packages("dataquieR")


# load the package

library(dataquieR)

# data ---------------------------------------------------------------------------------------------

# Study of Health in Pomerania example data

sd1 <- system.file("extdata", "ship.RDS", package = "dataquieR")

print(sd1)

# metadata

md1 <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")

print(md1)

# dq_report2() - a crude approach -------------------------------------------------------------------

my_dq_report <- dq_report2(study_data = sd1,
                           meta_data_v2  = md1,
                           label_col  = LABEL)

# view the results

print(my_dq_report)

The function dq_report2() and the print() for such reports can manage further arguments and settings. However, this sparse version is a good start to gaining insight into the data and may serve as the base to tailor more specific reports.