Description

The function con_contradictions_redcap considers a contradiction if impossible or seemingly erroneous combinations of data are observed in one participant. For example, if age of a participant is recorded repeatedly, the value of age must be always increasing. Contradiction checks rely on comparing two or more variables. Each value used for a comparison may represent a possible characteristic, but the combination of these two values is considered to be impossible or unlikely. Thus, con_contradictions_redcap is an implementation of the Logical contradictions and Empirical contradictions indicators, which belong to the Contradictions domain in the Consistency dimension. Logical contradictions are impossible combinations of data values. Empirical contradictions are highly unlikely given our knowledge of the facts.

The approach does not consider implausible or inadmissible values. For more details, see the user’s manual and the source code. This implementation is similar to the con_contradictions function, but it includes a parser that allows for the translation of contradictions rules using REDCap notation into R which facilitates the handling of rule definitions.

Usage and arguments

con_contradictions_redcap(
  study_data = study_data, 
  meta_data = meta_data, 
  label_col = label_col,
  threshold_value = threshold_value, 
  meta_data_cross_item = meta_data_cross_item,
  summarize_categories = FALSE
)

The con_contradictions_redcap function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.

  • meta_data: mandatory, the data frame containing the item level metadata.

  • meta_data_cross_item: mandatory, the data frame containing the cross-item level metadata containing definitions for the contradictions. See the Definition of contradictions for details on the required structure.

  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.

  • threshold_value: mandatory, a numerical value based on percentages ranging from 0 to 100 which decides on the grading of encountered ontradictions as problematic.

  • summarize_categories: optional, if TRUE a summary output is generated for the defined categories plus one plot per category. Requires a column ’CONTRADICTION_TYPE’ in the meta_data_cross_item.

  • use_value_labels: optional, whether to use the VALUE_LABELS column in the metadata to match the labels in the contradiction terms of meta_data_cross_item.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

Contradictions for this example are loaded as follows:

file_name <- system.file("extdata", "meta_data_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_cross_item <- prep_get_data_frame("cross-item_level") # cross-item_level is a sheet in meta_data_v2.xlsx

The following table shows the contradictions that were defined for this study data:

For the con_contradictions_redcap function, the columns CONTRADICTION_TERM, CHECKLABEL, and CONTRADICTION_TYPE in the check table are necessary. See Definition of contradictions for more details about how to enter the CONTRADICTION_TERM.

Contradictions without categories

The next call specifies the simplest analysis, without specifying the type of contradictions, and setting the threshold to 1%:

contradictions <- con_contradictions_redcap(study_data      = sd1,
                                            meta_data       = md1,
                                            label_col       = "LABEL",
                                            meta_data_cross_item     = meta_data_cross_item,
                                            threshold_value = 1)

con_contradictions_redcap returns three objects: FlaggedStudyData, SummaryTable and SummaryPlot.

Output 1: FlaggedStudyData

The dataframe FlaggedStudyData indicates whether contradictions were found (TRUE) or not (FALSE), for each observation in the study data. The object can be accessed via contradictions$FlaggedStudyData:

Obs flag_con01 flag_con02 flag_con03 flag_con04 flag_con05 flag_con06 flag_con07 flag_con08 flag_con09 flag_con10 flag_con11
1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


Output 2: Summary table

The second output is a data frame that shows the number and percentage of contradictions for each variable that has been examined. According to this result, a binary grading is also provided, see [Calculation of contradictions] for more details. Additionally, the output contains the columns CONTRADICTION_TERM, CONTRADICTION_TYPE and VARIABLE_TYPE from the meta_data_cross_item. A new column VARIABLELIST indicate which variables were involved in the contradiction check. It can be called with contradictions$SummaryTable:

VARIABLE_LIST CHECK_LABEL CONTRADICTION_TERM GOLDSTANDARD DATA_PREPARATION CHECK_ID NUM_con_con
AGE_0 | AGE_1 Age follow-up [AGE_1] < [AGE_0] NA LABEL | MISSING_NA | LIMITS 1 150
SEX_0 | SEX_1 Sex follow-up [SEX_1] <> [SEX_0] NA LABEL | MISSING_NA | LIMITS 2 150
EDUCATION_0 | EDUCATION_1 Education follow-up [EDUCATION_1] < [EDUCATION_0] NA LABEL | MISSING_NA | LIMITS 3 121
EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency vegetarian [EATING_PREFS_0] = “vegetarian” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) NA LABEL | MISSING_NA | LIMITS 4 54
EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency vegan [EATING_PREFS_0] = “vegan” and ([MEAT_CONS_0] in set(“1-2d a week”, “3-4d a week”, “5-6d a week”, “daily”)) NA LABEL | MISSING_NA | LIMITS 5 19
EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency [EATING_PREFS_0] = “none” and [MEAT_CONS_0] = “never” NA LABEL | MISSING_NA | LIMITS 6 64


Output 3: Summary plot

The third output displays the information in SummaryTable as a plot:

contradictions$SummaryPlot

Contradictions with categories

This example is similar to the previous one but considers the type of contradictions by setting summarize_categories = TRUE:

contradictions_cat <- con_contradictions_redcap(study_data      = sd1,
                                                meta_data       = md1,
                                                label_col       = "LABEL",
                                                meta_data_cross_item     = meta_data_cross_item,
                                                threshold_value = 1,
                                                summarize_categories = TRUE)

When the user specifies summarize_categories, the output of con_contradictions is classified according to the column CONTRADICTION_TYPE in the meta_data_cross_item. In this way, the list output includes the objects EMPIRICAL (for contradictions of empirical type) LOGICAL (for logical contradictions), and all_checks (including both types of contradictions). Each of these objects contains the respective dataframes FlaggedStudyData and SummaryTable, plus the SummaryPlot. The example below shows the summaries for empirical and logical type contradictions.

VARIABLE_LIST CHECK_LABEL GOLDSTANDARD DATA_PREPARATION CHECK_ID NUM_con_con
6 EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency NA LABEL | MISSING_NA | LIMITS 6 64
7 SMOKE_SHOP_0 | SMOKING_0 Non-smokers inconsistency NA LABEL | LIMITS 7 91
8 SMOKE_SHOP_0 | SMOKING_0 Smokers inconsistency NA LABEL | MISSING_NA | LIMITS 8 118
10 AGE_0 | PREGNANT_0 Pregnancy at high age NA LABEL | MISSING_NA | LIMITS 10 5


contradictions_cat$EMPIRICAL$SummaryPlot

VARIABLE_LIST CHECK_LABEL GOLDSTANDARD DATA_PREPARATION CHECK_ID NUM_con_con
1 AGE_0 | AGE_1 Age follow-up NA LABEL | MISSING_NA | LIMITS 1 150
2 SEX_0 | SEX_1 Sex follow-up NA LABEL | MISSING_NA | LIMITS 2 150
3 EDUCATION_0 | EDUCATION_1 Education follow-up NA LABEL | MISSING_NA | LIMITS 3 121
4 EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency vegetarian NA LABEL | MISSING_NA | LIMITS 4 54
5 EATING_PREFS_0 | MEAT_CONS_0 Nutrition inconsistency vegan NA LABEL | MISSING_NA | LIMITS 5 19
9 ARM_CIRC_DISC_0 | ARM_CUFF_0 Blood pressure false cuff NA LABEL | MISSING_NA | LIMITS 9 463


contradictions_cat$LOGICAL$SummaryPlot

Interpretation

Any contradiction in the study data should be resolved by appropriate data curation steps. In case of logical contradictions an error is encountered with certainty, implying the need to change or exclude at least one value from the data set. In case of an empirical contradiction it must be checked whether there is any possibility in the particular data set that the checked combination is true. If so, the decision rule should be adapted.

In case of a confirmed contradiction it still remains to be ascertained whether data errors stem from the data collection process itself or subsequent data management steps.

Definition of contradictions

Contradictions can be defined via logical comparison of variables. Assume \(A\) and \(B\) to represent two variables in the study data. Then:

  • if \(A \gt B\) a contradiction may follow

  • if \(A\) is not missing, then \(B\) should not be observed

  • if \(A \lt 18\) then \(B \ne \:"adult"\)

The definition of such comparisons in the CONTRADICTION_TERM column of the meta_data_cross_item broadly follows the REDCap notation for Data Quality Rules. These include the logical operators AND, OR, NOT, and =, as well as the mathematical operators for standard operations (+, -, /, *), comparisons (>, <, >=, <=) and precedence ((, )). To build statements, the variable names must be enclosed by square brackets (e.g., [AGE_0]). An operator then follows the variable (e.g., >=) and a comparison value, which can be a value (e.g., a number or a date) or another variable name. For example [AGE_0] >= 18 or [AGE_1] < [AGE_0]. See here for more examples.

Contradictions involving categorical variables can be specified either using labels, if they are given in the metadata column VALUE_LABELS, or using codes. For example, ifVALUE_LABELS defines the labels, a contradiction for the variable PREGNANT_0 can be created with the term [PREGNANT_0] = "yes" (see row 10 in the contradictions meta_data_cross_item in Example output). This is the default notation if the metadata contains a non-empty VALUE_LABELS column. Alternatively, the contradiction term can directly specify the label; the corresponding example would then be written as [PREGNANT_0] = 1. To use this notation, even if VALUE_LABELS is defined in the metadata, include the (optional) argument use_value_labels = FALSE in con_contradictions_redcap.

Algorithm of the implementation

  1. Remove missing codes from the study data (if defined in the metadata)
  2. Remove measurements deviating from limits defined in the metadata
  3. Assign label to levels of categorical variables (if applicable)
  4. Apply contradiction checks
  5. Identification of measurements fulfilling contradiction rules. Therefore, two output data frames are generated:
    • one on the level of the observations, to flag each contradictory value combination, and
    • a summary table for each contradiction check.
  6. A summary plot illustrating the number of contradictions is generated.

Concept relations