This document describes the generation of a quality report using a 50% random data sample from the Study of Health in Pomerania START cohort, baseline examination (SHIP-START-0, 1997-2001). For further information on this study please see Völzke et al. 2010. Some noise has been introduced to the data to secure anonymity and for illustrative purposes.
The first step in the data quality assessment workflow evaluates the compliance of the study data to the respective metadata regarding formal and structural requirements. Users must provide both data and metadata as data frames.
Note: The metadata file is the primary point of reference for generating data quality reports:
In this example, the SHIP data are loaded from the
dataquieR
package:
sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))
The imported study data consists of:
Similarly, the respective metadata is loaded from
dataquieR
:
metadata_file <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(metadata_file)
md1 <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
The imported metadata provide information for:
An identical number of variables in both files is desirable but not necessary. Attributes (i.e. columns in the metadata) comprise information on each variable of the study data file, such as labels or admissibility limits.
The integrity check starts by calling the function
pro_applicability_matrix()
. The data quality indicators
covered by this function are:
pro_applicability_matrix()
generates a heatmap-like plot
for the applicability of all dataquieR
functions to the
study data, using the provided metadata as a point of reference:
appmatrix <- pro_applicability_matrix(study_data = sd1,
meta_data = md1,
label_col = LABEL,
split_segments = TRUE)
The heatmap can be retrieved by the command:
appmatrix$ApplicabilityPlot
As split_segments = TRUE
was used as an argument, all
output is organized by the study segments defined in the metadata. In
this case, there are data from four examination segments: the
computer-assisted interview, intro (basic information on the
participants, such as sociodemographic information and examination
date), laboratory analysis, and somatometric examination. The assignment
of variables to segments is done in the metadata file.
The applicability checks results are technical, i.e., the function compares, for example, the data type as defined in the metadata with those observed in the study data. The light blue areas indicate that additional checks would be possible for many variables if additional metadata would be provided.
Note: Applying all technically feasible data quality implementations to all study data variables is not advisable. For example, detection limits are not meaningful for participants` IDs. However, the variable ID is represented as an integer, which technically allows checking detection limits.
All datatype issues found by pro_applicability_matrix()
should be checked data element by data element. For instance, a major
issue was found in the variable WAIST_CIRC_0. This variable is
represented in the study data with datatype character, which
differs from the expected datatype float defined in the
metadata. Some basic checks show the misuse of comma as the decimal
delimiter.
To correct this issue, the conversion of WAIST_CIRC_0 to datatype numeric will coerce respective values to NA, which should be avoided. Hence, we replace the comma with the correct delimiter and corrected the datatype without losing data values. The resulting applicability plot shows no more issues.
# replace comma with the correct delimiter
sd1$waist <- as.numeric(gsub(",", ".", sd1$waist))
pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL)$ApplicabilityPlot
The next major step in the data quality assessment workflow is to assess the occurrence and patterns of missing data. The sequence of checks in this example is ordered according to common stages of a data collection:
Level | Description |
---|---|
Unit missingness | Subjects without information on any of the provided study variables |
Segment missingness | Subjects without information for all variables on a defined study segment (e.g., some examination) |
Item missingness | Subjects without information on data fields within segments |
Following this sequence enables calculating the correct denominators to compute item missingness. Such calculations are particularly important for complex cohort studies in which different levels of examination programs are conducted. For example, only half of a study population might be selected for an MRI examination. In the remaining 50%, the respective MRI variables are not included per study design. This should be considered if item missingness is examined.
This check identifies subjects without any measurements on the provided target variables for a data quality check.
Note: The interpretation of findings depends on the scope of the provided variables and data records. In this example, the study data set comprises examined SHIP participants, not the target sample. Accordingly, the check is not about study participation. Rather, it identifies cases for which unexpectedly no information has been provided at all. Any identified case would indicate a data management problem.
The indicator covered by my_unit_missings2()
is:
Unit missingness can be assessed with:
my_unit_missings2 <- com_unit_missingness(study_data = sd1,
meta_data = md1,
label_col = LABEL,
id_vars = "ID")
my_unit_missings2$SummaryData
In total 0 units have missings in all variables of the study data. Thus, for each participant there is at least one variable with information.
Subsequently, a check is performed that identifies subjects without any measurements within each of the four defined study segments.
The indicator covered by com_segment_missingness()
is:
The table output can be retrieved with:
MissSegs <- com_segment_missingness(study_data = sd1,
meta_data = md1,
threshold_value = 1,
direction = "high",
exclude_roles = c("secondary", "process"))
MissSegs$SummaryData
Exploring segment missingness over time requires another variable in
the study data. Information regarding this variable can be added to the
metadata using the dataquieR
function
prep_add_to_meta()
:
# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))
# add metadata for this variable
md1 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exyear",
DATA_TYPE = "integer",
LABEL = "EX_YEAR_0",
VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
VARIABLE_ROLE = "process",
meta_data = md1)
With a discretized variable for examination year (EX_YEAR_0) the
occurrence pattern by year can subsequently be assessed using the
command com_segment_missingness()
:
MissSegs <- com_segment_missingness(study_data = sd1,
meta_data = md1,
threshold_value = 1,
label_col = LABEL,
group_vars = "EX_YEAR_0",
direction = "high",
exclude_roles = "process")
MissSegs$SummaryPlot
The plot is a descriptor of the indicator:
It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th years.
The final check in the completeness dimension identifies
subjects with missing information in variables of all study segments.
The covered indicators by the function
com_item_missingness()
are:
Item missingness can be assessed by using the following call:
item_miss <- com_item_missingness(study_data = sd1,
meta_data = md1,
show_causes = TRUE,
label_col = "LABEL",
include_sysmiss = TRUE,
threshold_value = 95
)
A result overview can be obtained by requesting a summary table of this function:
item_miss$SummaryTable
The table provides one line for each of the 29 variables. Of particular interest are:
The table shows that the variable HOUSE_INCOME_MONTH_0 (monthly net household income) is affected by many missing values. In addition, age of diabetes onset (DIAB_AGE_ONSET_0) was only coded for 173 subjects, but most values are missing because of an intended jump.
Note: In case that jump codes have been used, e.g., for the variable CONTRACEPTIVE_EVER_0, the denominator for the calculation of item missingness is corrected for the amount of jump codes used.
The summary plot delivers a different view of missing data by providing the frequency of the specified reasons for missing data. The corresponding indicator is:
item_miss$SummaryPlot
In the plot, the balloon size is determined by the number of missing data fields. It can now be inferred that, for example, the elevated number of missing values for the item HOUSE_INCOME_MONTH_0 is mainly caused by refusals of participants to answer the respective question.
Consistency is targeted after completeness has been examined because it requires data without missing and jump codes. Consistency (a main aspect of correctness), describes the degree to which data values are free of breaks in conventions or contradictions. Different data types may be addressed in respective checks.
The indicator covered by con_limit_deviations()
when
specifying limits = "HARD_LIMITS"
is:
Note: When specifying
limits = "SOFT_LIMITS"
the check does not identify
inadmissible but uncertain values, according to the specified ranges.
The related indicator is then:
The call in this example is:
MyValueLimits <- con_limit_deviations(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
limits = "HARD_LIMITS")
A table output provides the number and percentage of all range violations for the variables with limits specified in the metadata:
MyValueLimits$SummaryData
The last column of the table also provides a GRADING. If the percentage of violations is above some threshold, a GRADING of 1 is assigned. In this case any occurrence is classified as problematic. Otherwise the GRADING is 0.
The following statement assigns all variables identified as
problematic to the R object whichdeviate
to enable a more
targeted output, for example to plot the distributions for any variable
with violations along the specified limits:
# select variables with deviations
whichdeviate <- as.character(MyValueLimits$SummaryTable$Variables)[MyValueLimits$SummaryTable$GRADING == 1]
We can restrict the plots to those where variables have limit
deviations, i.e., those with a GRADING of 1 in the table above, using
MyValueLimits$SummaryPlotList[whichdeviate]
(only the first
two are displayed below to reduce file size):
A comparable check may be performed for categorical variables using
the command con_inadmissible_categorical()
:
The covered indicator is:
The call is:
IAVCatAll <- con_inadmissible_categorical(study_data = sd1,
meta_data = md1,
label_col = "LABEL")
As with inadmissible numerical values, a table output displays the observed categories, the defined categories, any non matching level, its count, and a GRADING:
IAVCatAll$SummaryData
The results show that there is one variable, SCHOOL_GRAD_0, with one inadmissible level occurring.
The second main type of checks within the consistency dimension
concerns contradictions. The covered indicators by the functions
con_contradictions_redcap()
are:
The rules to identify contradictions must first be uploaded from a spreadsheet The creation of this spreadsheet is supported by a Shiny App. Overall, 12 different logical comparisons can be applied. An overview is given in the respective tutorial. Each line within the spreadsheet defines one check rule.
checks <- prep_get_data_frame("cross-item_level") # cross-item_level is a sheet in ship_meta_v2.xlsx, which was loaded earlier
Subsequently, the contradictions assessment may be triggered by
con_contradictions_redcap()
using the table checks
as the point of reference:
AnyContradictions <- con_contradictions_redcap(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
meta_data_cross_item = checks,
threshold_value = 1)
A summary table shows the number and percentage of contradictions for each defined rule:
AnyContradictions$SummaryTable
In this example, rule seven leads to the identification of 35 contradictions: age of onset for diabetes is provided (DIAB_AGE_ONSET_0), but the variable on the presence of diabetes (DIABETES_KNOWN_0) does not indicate a known disease.
The distributions may also be displayed as a plot:
AnyContradictions$SummaryPlot
In contrast to most consistency related indicators, accuracy findings indicate an elevated probability that some data quality issue exists, rather than a certain issue.
Univariate outliers are assessed based on statistical criteria. The covered indicator is:
The function acc_robust_univariate_outlier()
identifies
outliers according to the approaches of Tukey,
SixSigma,
Hubert,
and the heuristic approach of SigmaGap. It may be called as
follows:
UnivariateOutlier <- dataquieR:::acc_robust_univariate_outlier(study_data = sd1,
meta_data = md1,
label_col = "LABEL")
AThe first output is a table that provides descriptive statistics and detected outliers according to the different criteria:
UnivariateOutlier$SummaryTable
There are outliers according to at least two criteria in most variables, but only for the diastolic blood pressure variables (DBP_0.1 and DBP_0.2) two outliers have been detected using the Sigma-gap criterion.
To obtain a better insight on univariate distributions, a plot is
provided (call it with UnivariateOutlier$SummaryPlotList
).
It highlights observations for each variable according to the number of
violated rules (only the first four are shown here):
The function acc_multivariate_outlier()
identifies
outliers related to the indicator:
acc_multivariate_outlier()
uses the same rules as
acc_robust_univariate_outlier()
for the identification of
outliers.
The following function call relates systolic and diastolic blood pressure measurement to age and weight and a table output is created for the number of detected multivariate outliers:
MVO_SBP0.1 <- acc_multivariate_outlier(variable_group = c("SBP_0.1", "DBP_0.1", "AGE_0", "BODY_WEIGHT_0"),
study_data = sd1,
meta_data = md1,
id_vars = "ID",
label_col = "LABEL")
MVO_SBP0.1$SummaryTable
The number of outliers varies considerably, depending on the criterion. Subsequently a parallel-coordinate-plot may be requested to further inspect results:
MVO_SBP0.1$SummaryPlot
Another example is the inspection of the first and second systolic blood pressure measurements:
MVO_DBP <- acc_multivariate_outlier(variable_group = c("SBP_0.1", "SBP_0.2"),
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
MVO_DBP$SummaryTable
MVO_DBP$SummaryPlot
The function acc_distributions()
describes distributions
using histograms and displays empirical cumulative distribution function
(ecdf) if a grouping variable is provided. The function is only
descriptive and not related to a specific indicator. Instead, the
results may relevant to most indicators within the unexpected
distribution domain.
The following example examines measurements in which a possible influence of the examiners is considered:
ECDFSoma <- acc_distributions(resp_vars = c("WAIST_CIRC_0", "BODY_HEIGHT_0", "BODY_WEIGHT_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
The respective list of plots may be displayed using
ECDFSoma$SummaryPlotList
(only the first 2 plots are
displayed below):
The function acc_margins()
is mainly related to the
indicators:
However, it also provides descriptive output such as violin and box
plots for continuous variables, count plots for categorical data, and
density plots for both. The main application of
acc_margins()
is to make inference on effects related to
process variables such as examiners, devices, or study centers. The
function determines whether measurements are provided as continuous or
discrete. Alternatively, metadata specifications may provide this
information.
In the first example acc_margins()
is applied to the
variable waist circumference (WAIST_CIRC_0). In this case,
dependencies related to the examiners (OBS_SOMA_0) are assessed
while the raw measurements are controlled for variable age and sex
(AGE_0, SEX_0):
marginal_dists <- acc_margins(resp_vars = "WAIST_CIRC_0",
co_vars = c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
A plot is provided to view the results:
marginal_dists$SummaryPlot
Based on a statistical test, no mean waist circumference of any examiner differed substantially (p<0.05) from the overall mean.
The situation is different when assessing the coded myocardial infarction (MYOCARD_YN_0) across examiners while controlling for age and sex:
marginal_dists <- acc_margins(resp_vars = "MYOCARD_YN_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_INT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
marginal_dists$SummaryPlot
The result shows elevated proportions for the examiners 05 and 07.
An important and related issue is the quantification of the observed
examiner effects, which is accomplished by the function
acc_varcomp()
related to the indicators:
acc_varcomp()
computes the percentage of variance of
some target variable, here attributable to the grouping variable while
controlling for some other variables (age and sex). The output may be
reviewed in a table format:
vcs <- acc_varcomp(resp_vars = "WAIST_CIRC_0",
co_vars = c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
vcs$SummaryTable
For the variable WAIST_CIRC_0, an ICC of 0.019 has been found which is below the threshold. The same is the case for the variable MYOCARD_YN_0, probably because the case count in the two deviant observers 05 and 07 is low:
vcs <- acc_varcomp(resp_vars = "MYOCARD_YN_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_INT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
vcs$SummaryTable
The study of effects across groups and times is particularly complex.
The function acc_loess()
provides a descriptor related to
the indicator:
acc_loess()
may also be used to obtain information
related to other indicators in the domain of unexpected distributions. A
sample call using waist circumference as the target variable is:
timetrends <- acc_loess(resp_vars = "WAIST_CIRC_0",
co_vars =c("AGE_0", "SEX_0"),
group_vars = "OBS_SOMA_0",
time_vars = "EXAM_DT_0",
study_data = sd1,
meta_data = md1,
label_col = "LABEL")
invisible(lapply(timetrends$SummaryPlotList, print))
The graph for this variable indicates no major discrepancies between the examiners over the examination period.
Assessing the shape of a distribution is, next to location parameters, an important aspect of accuracy.
The related indicator is:
Observed distributions can be tested against expected distributions
using the function acc_shape_or_scale()
.
In this example the uniform distribution for the use of measurement devices is examined:
MyUnexpDist1 <- acc_shape_or_scale(resp_vars = "DEV_BP_0",
guess = TRUE,
label_col = "LABEL",
dist_col = "DISTRIBUTION",
study_data = sd1,
meta_data = md1)
MyUnexpDist1$SummaryPlot
The plot illustrates that devices have not been used with comparable frequencies.
Another example examines the normal distribution of blood pressure:
MyUnexpDist2 <- acc_shape_or_scale(resp_vars = "SBP_0.2",
guess = TRUE,
label_col = "LABEL",
dist_col = "DISTRIBUTION",
study_data = sd1,
meta_data = md1)
MyUnexpDist2$SummaryPlot
The result reveals a slight discrepancy from the normality assumption. It is up to the person responsible for the data quality assessments to decide whether such a discrepancy is relevant.
The analysis of end digit preferences is a specific implementation related to the indicator:
In this example the uniform distribution of the end digits of body height are examined. Body height in SHIP-START-0 was a measurement which required the manual reading and transfer of data into an eCRF.
MyEndDigits <- acc_end_digits(resp_vars = "BODY_HEIGHT_0",
label_col = LABEL,
study_data = sd1,
meta_data = md1)
MyEndDigits$SummaryPlot
The graph highlights no relevant effects across the ten categories.
Output within the accuracy dimension frequently combines descriptive and inferential content, which is necessary to facilitate valid conclusions on data quality issues. Further details on all functions can be obtained following the links and in the software section.