This tutorial introduces the creation of data quality reports in R with dataquieR.
Creating reports requires the appropriate setup of study data and metadata, as shown in the figure below:
The first step is to
load dataquieR:
library(dataquieR)
Then, we can load one of the example data sets using:
sd1 <- prep_get_data_frame("ship")
This data set comes from the Study of Health in Pomerania (SHIP) project. The study data has 2154 observations and 29 variables:
sd1
id | exdate | age | sex | obs_bp | obs_soma | obs_int | dev_bp | dev_length | dev_weight |
---|---|---|---|---|---|---|---|---|---|
3861 | 1998-09-22 | 65 | 1 | 9 | 9 | 11 | 18 | 11 | 11 |
6506 | 1998-01-21 | 70 | 1 | 4 | 4 | 3 | 9 | 3 | 1 |
6096 | 1999-04-07 | 43 | 2 | 4 | 4 | 2 | 10 | 3 | 1 |
6674 | 2000-10-06 | 55 | 2 | 3 | 5 | 2 | 22 | 4 | 1 |
6490 | 1998-11-17 | 69 | 2 | 7 | 7 | 12 | 18 | 11 | 11 |
5366 | 1997-11-27 | 65 | 1 | 5 | 5 | 1 | 10 | 3 | 1 |
5735 | 1999-09-01 | 40 | 2 | 7 | 7 | 23 | 15 | 11 | 11 |
4031 | 1999-08-12 | 51 | 2 | 9 | 9 | 12 | 20 | 11 | 11 |
3578 | 2000-02-26 | 25 | 1 | 9 | 9 | 22 | 15 | 11 | 11 |
4807 | 2000-07-13 | 80 | 2 | 3 | 3 | 2 | 18 | 4 | 1 |
We can see that not all variable names are intuitive. Hence, the appropriate labels must be mapped from the metadata. Besides all variables' data types and labels, the metadata stores further expected characteristics and static information about the study data.
We can load the corresponding example metadata using:
prep_load_workbook_like_file("ship_meta_v2")
The metadata is a workbook containing several sheets or tables that can be called individually. The main metadata table is the item-level, which includes descriptions and expectations about single variables or items (e.g. columns in the study data table):
md1 <- prep_get_data_frame("item_level")
VAR_NAMES | LABEL | DATA_TYPE | VALUE_LABELS | STANDARDIZED_VOCABULARY_TABLE | MISSING_LIST_TABLE | HARD_LIMITS | DETECTION_LIMITS |
---|---|---|---|---|---|---|---|
id | ID | integer | NA | NA | NA | NA | NA |
exdate | EXAM_DT_0 | datetime | NA | NA | NA | [1995-01-01;) | NA |
sex | SEX_0 | integer | 1 = males | 2 = females | NA | NA | NA | NA |
age | AGE_0 | integer | NA | NA | NA | [20;Inf) | NA |
obs_bp | OBS_BP_0 | integer | 1 = Obs_01 | 2 = Obs_02 | 3 = Obs_03 | 4 = Obs_04 | 5 = Obs_05 | 6 = Obs_06 | 7 = Obs_07 | 8 = Obs_08 | 9 = Obs_09 | 10 = Obs_10 | 11 = Obs_11 | 12 = Obs_12 | 13 = Obs_13 | 14 = Obs_14 | 15 = Obs_15 | 16 = Obs_16 | 17 = Obs_17 | 18 = Obs_18 | 19 = Obs_19 | 20 = Obs_20 | NA | missing_table | NA | NA |
dev_bp | DEV_BP_0 | integer | 1 = Dev_01 | 2 = Dev_02 | 3 = Dev_03 | 4 = Dev_04 | 5 = Dev_05 | 6 = Dev_06 | 7 = Dev_07 | 8 = Dev_08 | 9 = Dev_09 | 10 = Dev_10 | 11 = Dev_11 | 12 = Dev_12 | 13 = Dev_13 | 14 = Dev_14 | 15 = Dev_15 | 16 = Dev_16 | 17 = Dev_17 | 18 = Dev_18 | 19 = Dev_19 | 20 = Dev_20 | 21 = Dev_21 | 22 = Dev_22 | 23 = Dev_23 | 24 = Dev_24 | 25 = Dev_25 | NA | missing_table | NA | NA |
sbp1 | SBP_0.1 | integer | NA | NA | NA | [80;200] | [0;265] |
sbp2 | SBP_0.2 | integer | NA | NA | NA | [80;200] | [0;265] |
dbp1 | DBP_0.1 | integer | NA | NA | NA | [40;160] | [0;265] |
dbp2 | DBP_0.2 | integer | NA | NA | NA | [40;160] | [0;265] |
Additional details and expectations about the joint use of two or more variables or items are defined in the cross-item level metadata:
cil <- prep_get_data_frame("cross-item_level")
VARIABLE_LIST | CHECK_LABEL | CONTRADICTION_TERM | CONTRADICTION_TYPE | MULTIVARIATE_OUTLIER_CHECKTYPE | N_RULES | ASSOCIATION_RANGE | ASSOCIATION_METRIC |
---|---|---|---|---|---|---|---|
NA | Systolic blood pressure lower than dyastolic blood pressure, first measurement | [sbp1] < [dbp1] | LOGICAL | NA | NA | NA | NA |
NA | Systolic blood pressure lower than dyastolic blood pressure, second measurement | [sbp2] < [dbp2] | LOGICAL | NA | NA | NA | NA |
NA | Body height lower than body weight | [BODY_HEIGHT_0] < [BODY_WEIGHT_0] | LOGICAL | NA | NA | NA | NA |
NA | Body height lower than waist circumference | [BODY_HEIGHT_0] < [WAIST_CIRC_0] | LOGICAL | NA | NA | NA | NA |
NA | Contraception inconsistency | [SEX_0] = “males” and [CONTRACEPTIVA_EVER_0] = “yes” | LOGICAL | NA | NA | NA | NA |
NA | Diabetes age inconsistency 1 | [DIABETES_KNOWN_0] = “yes” NOT [DIAB_AGE_ONSET_0] > 0 | EMPIRICAL | NA | NA | NA | NA |
NA | Diabetes age inconsistency 2 | [DIAB_AGE_ONSET_0] > 0 and [DIABETES_KNOWN_0] = “no” | LOGICAL | NA | NA | NA | NA |
sbp1 | sbp2 | Systolic blood pressure checks | NA | NA | Hubert | 1 | NA | NA |
dbp1 | dbp2 | Diastolic blood pressure checks | NA | NA | Hubert | 1 | NA | NA |
sbp1 | sbp2 | Systolic blood pressure checks | NA | NA | NA | NA | (0.7;) | Pearson |
Descriptions and expectations about the provided segments (e.g., different study examinations) are given in the segment level metadata:
sl <- prep_get_data_frame("segment_level")
STUDY_SEGMENT | SEGMENT_RECORD_COUNT | SEGMENT_ID_TABLE | SEGMENT_RECORD_CHECK | SEGMENT_ID_VARS | SEGMENT_UNIQUE_ROWS |
---|---|---|---|---|---|
INTRO | 2154 | expected_id_segment | exact | id | TRUE |
SOMATOMETRY | 500 | expected_id_segment | exact | id | TRUE |
INTERVIEW | 2150 | expected_id_segment | exact | id | TRUE |
LABORATORY | 500 | expected_id_segment | subset | id | TRUE |
For more information on the example data and metadata, see the example data description and the metadata tutorial.
We can create a default report using the dq_report2()
function, which requires only the data and metadata previously
loaded:
dq_report2(study_data = sd1) # metadata will be found, if prep_load_workbook_like_file did run before.
The animation below shows a quick workflow for reporting data quality with dataquieR:
r <- dq_report2("ship", meta_data_v2 = "ship_meta_v2")
dir.create("report_v2/")
print(r, dir = "report_v2/")
You can see the example report generated by dq_report2()
here.
The code shown in the animation to produce a report is given here:
# --------------------------------------------------------------------------------------------------
# D A T A Q U A L I T Y I N E P I D E M I O L O G I C A L R E S E A R C H
#
# == dataquieR
#
# dq_report2() eases the generation of data quality reports as it automatically calls dataquieR functions
#
#
# Installation/Further Information -----------------------------------------------------------------
#
# Please see our website:
# https://dataquality.qihs.uni-greifswald.de/
#
# (install dataquieR from CRAN using)
#
# or
#
# currently, you should install the development version as described
# on https://dataquality.ship-med.uni-greifswald.de/DownloadR.html
install.packages("dataquieR")
# load the package
library(dataquieR)
# data ---------------------------------------------------------------------------------------------
# Study of Health in Pomerania example data
sd1 <- prep_get_data_frame("ship")
print(sd1)
# metadata
prep_load_workbook_like_file("ship_meta_v2")
print(md1)
# dq_report2() - a crude approach -------------------------------------------------------------------
my_dq_report <- dq_report2(study_data = sd1,
meta_data_v2 = "ship_meta_v2",
label_col = LABEL)
# view the results
print(my_dq_report)
The function dq_report2()
and the print()
for such reports can manage further arguments and settings. However,
this sparse version is a good start to gaining insight into the data and
may serve as the base to tailor more specific reports.