Description

The function con_limit_deviations examines the admissibility or uncertainty of numerical study data according to the intervals defined in the metadata. The target values can be of type integer, float or datetime. Thus, con_limit_deviations is an implementation of the Inadmissible numerical values and Uncertain numerical values indicators, as well as the Inadmissible time-date values and Uncertain time-date values indicators. These belong to the Range and value violations domain in the Consistency dimension.

For more details, see the user’s manual and the source code.

Usage and arguments

con_limit_deviations(
  resp_vars = NULL,
  label_col = NULL,
  study_data = sd1,
  meta_data = md1,
  limits = c("HARD_LIMITS", "SOFT_LIMITS", "DETECTION_LIMITS")
)

The con_limit_deviations function has the following arguments:

  • resp_vars: the name of the continuous measurement variable
  • label_col: if labels should be used specify column of metadata containing the labels
  • limits: which limits should be investigated (HARD_LIMITS, SOFT_LIMITS, or DETECTION_LIMITS)
  • study_data: the name of the data frame that contains the measurements
  • meta_data: the name of the data frame that contains item-level metadata

This implementation makes no use of thresholds.

CAVEAT:

In the naming of the following function we deviate from other implementations. This is motivated by the generic use of a function that can process different types of limits, i.e. if SOFT_LIMITS or DETECTION_LIMITS. A necessary convention is the identical definition of limits as shown in the next example.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the con_limit_deviations function, the columns HARD_LIMITS, MISSING_LIST and JUMP_LIST in the metadata are particularly relevant.

HARD_LIMITS have to be defined as intervals:

  • \([0; 100]\): any value between 0 and 100, including 0 or 100

  • \((0; 100)\): any value between 0 and 100, not including 0 or 100

  • \([0; Inf)\): any positive numerical value, including 0

This table shows the metadata defined for the example data that required for this implementation:

VAR_NAMES LABEL MISSING_LIST JUMP_LIST HARD_LIMITS
4 v00003 AGE_0 NA NA [18;Inf)
39 v00030 MEDICATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;1]
1 v00000 CENTER_0 NA NA NA
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;4]
23 v00016 DEV_NO_0 NA NA NA
43 v40000 PART_INTERVIEW NA NA NA
14 v00009 ARM_CIRC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA [0;Inf)
18 v00012 USR_BP_0 99981 | 99982 NA NA
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA [0;1]
21 v00014 CRP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 NA [0;Inf)


However, this function can also be used with other columns of the metadata that contain limit definitions according to the conventions mentioned above. Currently, SOFT_LIMITS and DETECTION_LIMITS are also handled by the function.

For selected response variables

The function can be applied on selected variables using a vector of response variables. The output comprises two tables and plots for each selected variable. The function checks whether the respective limits are specified for each selected variable. If not, a warning is supplied.

limit_deviations_1 <- con_limit_deviations(resp_vars  = c("AGE_0", "SBP_0", "SEX_0"),
                                      label_col  = "LABEL",
                                      study_data = sd1,
                                      meta_data  = md1,
                                      limits     = "HARD_LIMITS", 
                                      return_flagged_study_data = TRUE)

Output 1: FlaggedStudyData

The first table is related to the study data by a 1:1 relationship, i.e. for each observation is checked whether the value is below or above the limits. Call it with limit_deviations_1$FlaggedStudyData:

PSEUDO_ID QUEST_DT_0 PART_QUESTIONNAIRE AGE_0_HARD_LIMITS SBP_0_HARD_LIMITS SBP_0_DETECTION_LIMITS SBP_0_SOFT_LIMITS
LEIIX715 2018-01-16 00:00:00 1 within within within within
QHNKM456 2018-01-13 00:00:00 1 within within within within
HTAOB589 2018-01-16 02:54:46 1 within within within within
HNHFV585 2018-01-11 05:49:33 1 within within within within
UTDLS949 2018-01-13 05:49:33 1 within within within within
YQFGE692 2018-01-14 08:44:20 1 within within within within


Output 2: SummaryData

The second table summarizes this information for each variable. Use limit_deviations_1$SummaryData to display it:

Variables Limits Below.limits-N (%) Within.limits-N (%) Above.limits-N (%)
1 AGE_0 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
4 SBP_0 HARD_LIMITS 0 (0) 2561 (100) 0 (0)
7 SBP_0 DETECTION_LIMITS 0 (0) 2561 (100) 0 (0)
10 SBP_0 SOFT_LIMITS 0 (0) 2561 (100) 0 (0)

Output 3: SummaryPlotList

The plots for each variable are either a histogram (continuous) or a barplot (discrete) and all are added to a list which is accessed via MyValueLimits$SummaryPlotList.

Output 4: ModifiedStudyData

The fourth output object is a dataframe similar to the study data, however, limit deviations have been removed. Access it using limit_deviations_1$ModifiedStudyData.

Without specification of response variables

It is not necessary to specify variables. In this case the functions seeks for all numeric variables with defined limits. If the function identifies limit deviations, the respective values are removed in the dataframe of ModifiedStudyData.

limit_deviations_2 <- con_limit_deviations(label_col  = "LABEL",
                                      study_data = sd1,
                                      meta_data  = md1,
                                      limits     = "HARD_LIMITS")

Output 2: Summary data table

Variables Limits Below.limits-N (%) Within.limits-N (%) Above.limits-N (%)
1 AGE_0 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
4 AGE_1 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
7 SBP_0 HARD_LIMITS 0 (0) 2561 (100) 0 (0)
10 SBP_0 DETECTION_LIMITS 0 (0) 2561 (100) 0 (0)
13 SBP_0 SOFT_LIMITS 0 (0) 2561 (100) 0 (0)
16 DBP_0 HARD_LIMITS 0 (0) 2544 (100) 0 (0)
19 DBP_0 DETECTION_LIMITS 0 (0) 2544 (100) 0 (0)
22 DBP_0 SOFT_LIMITS 3 (0.12) 2470 (97.09) 71 (2.79)
25 GLOBAL_HEALTH_VAS_0 HARD_LIMITS 0 (0) 2618 (100) 0 (0)
28 GLOBAL_HEALTH_VAS_0 SOFT_LIMITS 257 (9.82) 2090 (79.83) 271 (10.35)
31 ASTHMA_0 HARD_LIMITS 0 (0) 2641 (100) 0 (0)
34 ARM_CIRC_0 HARD_LIMITS 0 (0) 2657 (100) 0 (0)
37 ARM_CIRC_0 SOFT_LIMITS 0 (0) 2657 (100) 0 (0)
40 ARM_CIRC_DISC_0 HARD_LIMITS 0 (0) 2633 (100) 0 (0)
43 ARM_CUFF_0 HARD_LIMITS 0 (0) 2623 (100) 0 (0)
46 EXAM_DT_0 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
49 CRP_0 HARD_LIMITS 0 (0) 2699 (100) 0 (0)
52 CRP_0 DETECTION_LIMITS 5 (0.19) 2694 (99.81) 0 (0)
55 CRP_0 SOFT_LIMITS 130 (4.82) 2561 (94.89) 8 (0.3)
58 BSG_0 HARD_LIMITS 0 (0) 2686 (100) 0 (0)
61 BSG_0 SOFT_LIMITS 92 (3.43) 2264 (84.29) 330 (12.29)
64 LAB_DT_0 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
67 EDUCATION_0 HARD_LIMITS 0 (0) 2472 (100) 0 (0)
70 EDUCATION_1 HARD_LIMITS 0 (0) 2422 (99.88) 3 (0.12)
73 MARRIED_0 HARD_LIMITS 0 (0) 2366 (100) 0 (0)
76 N_CHILD_0 SOFT_LIMITS 0 (0) 2249 (96.28) 87 (3.72)
79 EATING_PREFS_0 HARD_LIMITS 0 (0) 2328 (100) 0 (0)
82 MEAT_CONS_0 HARD_LIMITS 0 (0) 2302 (100) 0 (0)
85 SMOKING_0 HARD_LIMITS 0 (0) 2292 (100) 0 (0)
88 SMOKE_SHOP_0 HARD_LIMITS 0 (0) 782 (97.02) 24 (2.98)
91 N_INJURIES_0 SOFT_LIMITS 0 (0) 2161 (98.27) 38 (1.73)
94 N_BIRTH_0 SOFT_LIMITS 0 (0) 1098 (99.91) 1 (0.09)
97 PREGNANT_0 HARD_LIMITS 0 (0) 1065 (100) 0 (0)
100 MEDICATION_0 HARD_LIMITS 0 (0) 292 (45.55) 349 (54.45)
103 N_ATC_CODES_0 HARD_LIMITS 0 (0) 2058 (100) 0 (0)
106 INT_DT_0 HARD_LIMITS 0 (0) 2940 (100) 0 (0)
109 ITEM_1_0 HARD_LIMITS 0 (0) 2248 (100) 0 (0)
112 ITEM_2_0 HARD_LIMITS 0 (0) 2197 (100) 0 (0)
115 ITEM_3_0 HARD_LIMITS 0 (0) 2184 (100) 0 (0)
118 ITEM_4_0 HARD_LIMITS 0 (0) 2143 (100) 0 (0)
121 ITEM_5_0 HARD_LIMITS 0 (0) 2074 (100) 0 (0)
124 ITEM_6_0 HARD_LIMITS 0 (0) 2048 (100) 0 (0)
127 ITEM_7_0 HARD_LIMITS 0 (0) 2068 (100) 0 (0)
130 ITEM_8_0 HARD_LIMITS 0 (0) 2013 (100) 0 (0)
133 QUEST_DT_0 HARD_LIMITS 9 (0.31) 2931 (99.69) 0 (0)


Output 3: Plot List

Here, only five plots are displayed. However, for each variable with limits, a plot has been generated.

Variables of type datetime

The con_limit_deviations function can also be applied to datetime variables:

limit_deviations_3 <- con_limit_deviations(resp_vars  = c("QUEST_DT_0"),
                                      label_col  = "LABEL",
                                      study_data = sd1,
                                      meta_data  = md1,
                                      limits     = "HARD_LIMITS")

Output 2: Summary Data

Variables Limits Below.limits-N (%) Within.limits-N (%) Above.limits-N (%)
QUEST_DT_0 HARD_LIMITS 9 (0.31) 2931 (99.69) 0 (0)

Output 3: Plot List

Interpretation

The definition of HARD_LIMITS is a common issue in the data curation process. For example, values of a numeric rating scale (0 - 10) should not exceed these limits and values outside these limits must be removed or at least verified as they represent certain incorrect measurements. Nevertheless, there are measurements in which the definition of such limits is difficult. In this case the alternative definition of SOFT_LIMITS is recommended.

Algorithm of the implementation

  1. Remove missing codes from the study data (if defined in the metadata)
  2. Interpretation of variable specific intervals as supplied in the metadata.
  3. Identification of measurements outside defined limits. Therefore two output data frames are generated:
    • on the level of observation to flag each deviation, and
    • a summary table for each variable.
  4. A list of plots is generated for each variable examined for limit deviations. The histogram-like plots indicate respective limits as well as deviations.
  5. Values exceeding limits are removed in a data frame of modified study data

Concept relations