The function con_contradictions_redcap
considers a
contradiction if impossible or seemingly erroneous combinations of data
are observed in one participant. For example, if age of a participant is
recorded repeatedly, the value of age must be always increasing.
Contradiction checks rely on comparing two or more variables. Each value
used for a comparison may represent a possible characteristic, but the
combination of these two values is considered to be impossible or
unlikely. Thus, con_contradictions_redcap
is an
implementation of the Logical
contradictions and Empirical
contradictions indicators, which belong to the Contradictions domain in the Consistency dimension. Logical
contradictions are impossible combinations of data values. Empirical
contradictions are highly unlikely given our knowledge of the facts.
The approach does not consider implausible or inadmissible values.
For more details, see the user’s
manual and the source
code. This implementation is similar to the con_contradictions function,
but it includes a parser that allows for the translation of
contradictions rules using REDCap
notation into
R
which facilitates the handling of rule definitions.
con_contradictions_redcap(
study_data = study_data,
meta_data = meta_data,
label_col = label_col,
threshold_value = threshold_value,
meta_data_cross_item = meta_data_cross_item,
summarize_categories = FALSE
)
The con_contradictions_redcap
function has the following
arguments:
study_data: mandatory, the data frame containing the measurements.
meta_data: mandatory, the data frame containing the item level metadata.
meta_data_cross_item: mandatory, the data frame containing the cross-item level metadata containing definitions for the contradictions. See the Definition of contradictions for details on the required structure.
label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
threshold_value: mandatory, a numerical value based on percentages ranging from 0 to 100 which decides on the grading of encountered ontradictions as problematic.
summarize_categories: optional, if
TRUE
a summary output is generated for the defined
categories plus one plot per category. Requires a column
’CONTRADICTION_TYPE’ in the meta_data_cross_item
.
use_value_labels: optional, whether to use the
VALUE_LABELS
column in the metadata to match the labels in
the contradiction terms of meta_data_cross_item
.
To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
Contradictions for this example are loaded as follows:
file_name <- system.file("extdata", "meta_data_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_cross_item <- prep_get_data_frame("cross-item_level") # cross-item_level is a sheet in meta_data_v2.xlsx
The following table shows the contradictions that were defined for this study data:
For the con_contradictions_redcap
function, the columns
CONTRADICTION_TERM
, CHECKLABEL
, and
CONTRADICTION_TYPE
in the check table are necessary. See Definition of contradictions
for more details about how to enter the
CONTRADICTION_TERM
.
The next call specifies the simplest analysis, without specifying the type of contradictions, and setting the threshold to 1%:
contradictions <- con_contradictions_redcap(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
meta_data_cross_item = meta_data_cross_item,
threshold_value = 1)
con_contradictions_redcap returns three objects:
FlaggedStudyData
, SummaryTable
and
SummaryPlot
.
Output 1: FlaggedStudyData
The dataframe FlaggedStudyData
indicates whether
contradictions were found (TRUE
) or not
(FALSE
), for each observation in the study data. The object
can be accessed via contradictions$FlaggedStudyData
:
Obs | flag_con01 | flag_con02 | flag_con03 | flag_con04 | flag_con05 | flag_con06 | flag_con07 | flag_con08 | flag_con09 | flag_con10 | flag_con11 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE |
2 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
3 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
4 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE |
5 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE |
6 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE |
Output 2: Summary table
The second output is a data frame that shows the number and
percentage of contradictions for each variable that has been examined.
According to this result, a binary grading is also provided, see
[Calculation of contradictions] for more details. Additionally, the
output contains the columns CONTRADICTION_TERM
,
CONTRADICTION_TYPE
and VARIABLE_TYPE
from the
meta_data_cross_item
. A new column
VARIABLELIST
indicate which variables were involved in the
contradiction check. It can be called with
contradictions$SummaryTable
:
VARIABLE_LIST | CHECK_LABEL | CONTRADICTION_TERM | GOLDSTANDARD | DATA_PREPARATION | CHECK_ID | NUM_con_con |
---|---|---|---|---|---|---|
AGE_0 | AGE_1 | Age follow-up | [AGE_1] < [AGE_0] | NA | LABEL | MISSING_NA | LIMITS | 1 | 150 |
SEX_0 | SEX_1 | Sex follow-up | [SEX_1] <> [SEX_0] | NA | LABEL | MISSING_NA | LIMITS | 2 | 150 |
EDUCATION_0 | EDUCATION_1 | Education follow-up | [EDUCATION_1] < [EDUCATION_0] | NA | LABEL | MISSING_NA | LIMITS | 3 | 121 |
EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency vegetarian | [EATING_PREFS_0] = “vegetarian” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) | NA | LABEL | MISSING_NA | LIMITS | 4 | 54 |
EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency vegan | [EATING_PREFS_0] = “vegan” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) | NA | LABEL | MISSING_NA | LIMITS | 5 | 19 |
EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency | [EATING_PREFS_0] = “none” and [MEAT_CONS_0] = “never” | NA | LABEL | MISSING_NA | LIMITS | 6 | 64 |
Output 3: Summary plot
The third output displays the information in
SummaryTable
as a plot:
contradictions$SummaryPlot
This example is similar to the previous one but considers the type of
contradictions by setting summarize_categories = TRUE
:
contradictions_cat <- con_contradictions_redcap(study_data = sd1,
meta_data = md1,
label_col = "LABEL",
meta_data_cross_item = meta_data_cross_item,
threshold_value = 1,
summarize_categories = TRUE)
When the user specifies summarize_categories
, the output
of con_contradictions
is classified according to the column
CONTRADICTION_TYPE
in the
meta_data_cross_item
. In this way, the list output includes
the objects EMPIRICAL
(for contradictions of empirical
type) LOGICAL
(for logical contradictions), and
all_checks
(including both types of contradictions). Each
of these objects contains the respective dataframes
FlaggedStudyData
and SummaryTable
, plus the
SummaryPlot
. The example below shows the summaries for
empirical and logical type contradictions.
VARIABLE_LIST | CHECK_LABEL | GOLDSTANDARD | DATA_PREPARATION | CHECK_ID | NUM_con_con | |
---|---|---|---|---|---|---|
6 | EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency | NA | LABEL | MISSING_NA | LIMITS | 6 | 64 |
7 | SMOKE_SHOP_0 | SMOKING_0 | Non-smokers inconsistency | NA | LABEL | LIMITS | 7 | 91 |
8 | SMOKE_SHOP_0 | SMOKING_0 | Smokers inconsistency | NA | LABEL | MISSING_NA | LIMITS | 8 | 118 |
10 | AGE_0 | PREGNANT_0 | Pregnancy at high age | NA | LABEL | MISSING_NA | LIMITS | 10 | 5 |
contradictions_cat$EMPIRICAL$SummaryPlot
VARIABLE_LIST | CHECK_LABEL | GOLDSTANDARD | DATA_PREPARATION | CHECK_ID | NUM_con_con | |
---|---|---|---|---|---|---|
1 | AGE_0 | AGE_1 | Age follow-up | NA | LABEL | MISSING_NA | LIMITS | 1 | 150 |
2 | SEX_0 | SEX_1 | Sex follow-up | NA | LABEL | MISSING_NA | LIMITS | 2 | 150 |
3 | EDUCATION_0 | EDUCATION_1 | Education follow-up | NA | LABEL | MISSING_NA | LIMITS | 3 | 121 |
4 | EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency vegetarian | NA | LABEL | MISSING_NA | LIMITS | 4 | 54 |
5 | EATING_PREFS_0 | MEAT_CONS_0 | Nutrition inconsistency vegan | NA | LABEL | MISSING_NA | LIMITS | 5 | 19 |
9 | ARM_CIRC_DISC_0 | ARM_CUFF_0 | Blood pressure false cuff | NA | LABEL | MISSING_NA | LIMITS | 9 | 463 |
contradictions_cat$LOGICAL$SummaryPlot
Any contradiction in the study data should be resolved by appropriate data curation steps. In case of logical contradictions an error is encountered with certainty, implying the need to change or exclude at least one value from the data set. In case of an empirical contradiction it must be checked whether there is any possibility in the particular data set that the checked combination is true. If so, the decision rule should be adapted.
In case of a confirmed contradiction it still remains to be ascertained whether data errors stem from the data collection process itself or subsequent data management steps.
Contradictions can be defined via logical comparison of variables. Assume \(A\) and \(B\) to represent two variables in the study data. Then:
if \(A \gt B\) a contradiction may follow
if \(A\) is not missing, then \(B\) should not be observed
if \(A \lt 18\) then \(B \ne \:"adult"\)
The definition of such comparisons in the
CONTRADICTION_TERM
column of the
meta_data_cross_item
broadly follows the REDCap notation
for Data
Quality Rules. These include the logical operators AND
,
OR
, NOT
, and =
, as well as the
mathematical operators for standard operations (+
,
-
, /
, *
), comparisons
(>
, <
, >=
,
<=
) and precedence ((
, )
). To
build statements, the variable names must be enclosed by square brackets
(e.g., [AGE_0]
). An operator then follows the variable
(e.g., >=
) and a comparison value, which can be a value
(e.g., a number or a date) or another variable name. For example
[AGE_0] >= 18
or [AGE_1] < [AGE_0]
. See
here
for more examples.
Contradictions involving categorical variables can be specified
either using labels, if they are given in the metadata column
VALUE_LABELS
, or using codes. For example,
ifVALUE_LABELS
defines the labels, a contradiction for the
variable PREGNANT_0
can be created with the term
[PREGNANT_0] = "yes"
(see row 10 in the contradictions
meta_data_cross_item
in Example
output). This is the default notation if the metadata contains a
non-empty VALUE_LABELS
column. Alternatively, the
contradiction term can directly specify the label; the corresponding
example would then be written as [PREGNANT_0] = 1
. To use
this notation, even if VALUE_LABELS
is defined in the
metadata, include the (optional) argument
use_value_labels = FALSE
in
con_contradictions_redcap
.