The function con_limit_deviations
examines the
admissibility or uncertainty of numerical study data according to the
intervals defined in the metadata. The target values can be of type
integer
, float
or datetime
. Thus,
con_limit_deviations
is an implementation of the Inadmissible numerical values and Uncertain numerical values indicators,
as well as the Inadmissible time-date
values and Uncertain time-date
values indicators. These belong to the Range and value violations domain in the
Consistency dimension.
For more details, see the user’s manual and the source code.
con_limit_deviations(
resp_vars = NULL,
label_col = NULL,
study_data = sd1,
meta_data = md1,
limits = c("HARD_LIMITS", "SOFT_LIMITS", "DETECTION_LIMITS")
)
The con_limit_deviations
function has the following
arguments:
HARD_LIMITS
, SOFT_LIMITS
, or
DETECTION_LIMITS
)This implementation makes no use of thresholds.
CAVEAT:
In the naming of the following function we deviate from other
implementations. This is motivated by the generic use of a function that
can process different types of limits, i.e. if SOFT_LIMITS
or DETECTION_LIMITS
. A necessary convention is the
identical definition of limits as shown in the next example.
To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
For the con_limit_deviations
function, the columns
HARD_LIMITS
, MISSING_LIST
and
JUMP_LIST
in the metadata are particularly relevant.
HARD_LIMITS have to be defined as intervals:
\([0; 100]\): any value between 0 and 100, including 0 or 100
\((0; 100)\): any value between 0 and 100, not including 0 or 100
\([0; Inf)\): any positive numerical value, including 0
This table shows the metadata defined for the example data that required for this implementation:
VAR_NAMES | LABEL | MISSING_LIST | JUMP_LIST | HARD_LIMITS | |
---|---|---|---|---|---|
4 | v00003 | AGE_0 | NA | NA | [18;Inf) |
39 | v00030 | MEDICATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;1] |
1 | v00000 | CENTER_0 | NA | NA | NA |
34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;4] |
23 | v00016 | DEV_NO_0 | NA | NA | NA |
43 | v40000 | PART_INTERVIEW | NA | NA | NA |
14 | v00009 | ARM_CIRC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | NA | [0;Inf) |
18 | v00012 | USR_BP_0 | 99981 | 99982 | NA | NA |
33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA | [0;1] |
21 | v00014 | CRP_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 | NA | [0;Inf) |
However, this function can also be used with other columns of the
metadata that contain limit definitions according to the conventions
mentioned above. Currently, SOFT_LIMITS
and
DETECTION_LIMITS
are also handled by the function.
The function can be applied on selected variables using a vector of response variables. The output comprises two tables and plots for each selected variable. The function checks whether the respective limits are specified for each selected variable. If not, a warning is supplied.
limit_deviations_1 <- con_limit_deviations(resp_vars = c("AGE_0", "SBP_0", "SEX_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
limits = "HARD_LIMITS")
Output 1: FlaggedStudyData
The first table is related to the study data by a 1:1 relationship,
i.e. for each observation is checked whether the value is below or above
the limits. Call it with
limit_deviations_1$FlaggedStudyData
:
Output 2: SummaryData
The second table summarizes this information for each variable. Use
limit_deviations_1$SummaryData
to display it:
Variables | Section | Limits | Number | Percentage |
---|---|---|---|---|
AGE_0 | below | HARD_LIMITS | 0 | 0 |
AGE_0 | within | HARD_LIMITS | 2940 | 100 |
AGE_0 | above | HARD_LIMITS | 0 | 0 |
SBP_0 | below | HARD_LIMITS | 0 | 0 |
SBP_0 | within | HARD_LIMITS | 2561 | 100 |
SBP_0 | above | HARD_LIMITS | 0 | 0 |
SBP_0 | below | DETECTION_LIMITS | 0 | 0 |
SBP_0 | within | DETECTION_LIMITS | 2561 | 100 |
SBP_0 | above | DETECTION_LIMITS | 0 | 0 |
SBP_0 | below | SOFT_LIMITS | 0 | 0 |
SBP_0 | within | SOFT_LIMITS | 2561 | 100 |
SBP_0 | above | SOFT_LIMITS | 0 | 0 |
Output 3: SummaryPlotList
The plots for each variable are either a histogram (continuous) or a
barplot (discrete) and all are added to a list which is accessed via
MyValueLimits$SummaryPlotList
.
Output 4: ModifiedStudyData
The fourth output object is a dataframe similar to the study data,
however, limit deviations have been removed. Access it using
limit_deviations_1$ModifiedStudyData
.
It is not necessary to specify variables. In this case the functions seeks for all numeric variables with defined limits. If the function identifies limit deviations, the respective values are removed in the dataframe of ModifiedStudyData.
limit_deviations_2 <- con_limit_deviations(label_col = "LABEL",
study_data = sd1,
meta_data = md1,
limits = "HARD_LIMITS")
## Did not find any 'SCALE_LEVEL' column in item-level meta_data. Predicting it from the data -- please verify these predictions, they may be wrong and lead to functions claiming not to be reasonably applicable to a variable.
## All variables for which limits are specified in the metadata are used.
Output 2: Summary data table
Variables | Section | Limits | Number | Percentage |
---|---|---|---|---|
AGE_0 | below | HARD_LIMITS | 0 | 0.00 |
AGE_0 | within | HARD_LIMITS | 2940 | 100.00 |
AGE_0 | above | HARD_LIMITS | 0 | 0.00 |
AGE_1 | below | HARD_LIMITS | 0 | 0.00 |
AGE_1 | within | HARD_LIMITS | 2940 | 100.00 |
AGE_1 | above | HARD_LIMITS | 0 | 0.00 |
SBP_0 | below | HARD_LIMITS | 0 | 0.00 |
SBP_0 | within | HARD_LIMITS | 2561 | 100.00 |
SBP_0 | above | HARD_LIMITS | 0 | 0.00 |
SBP_0 | below | DETECTION_LIMITS | 0 | 0.00 |
SBP_0 | within | DETECTION_LIMITS | 2561 | 100.00 |
SBP_0 | above | DETECTION_LIMITS | 0 | 0.00 |
SBP_0 | below | SOFT_LIMITS | 0 | 0.00 |
SBP_0 | within | SOFT_LIMITS | 2561 | 100.00 |
SBP_0 | above | SOFT_LIMITS | 0 | 0.00 |
DBP_0 | below | HARD_LIMITS | 0 | 0.00 |
DBP_0 | within | HARD_LIMITS | 2544 | 100.00 |
DBP_0 | above | HARD_LIMITS | 0 | 0.00 |
DBP_0 | below | DETECTION_LIMITS | 0 | 0.00 |
DBP_0 | within | DETECTION_LIMITS | 2544 | 100.00 |
DBP_0 | above | DETECTION_LIMITS | 0 | 0.00 |
DBP_0 | below | SOFT_LIMITS | 3 | 0.12 |
DBP_0 | within | SOFT_LIMITS | 2470 | 97.09 |
DBP_0 | above | SOFT_LIMITS | 71 | 2.79 |
GLOBAL_HEALTH_VAS_0 | below | HARD_LIMITS | 0 | 0.00 |
GLOBAL_HEALTH_VAS_0 | within | HARD_LIMITS | 2618 | 100.00 |
GLOBAL_HEALTH_VAS_0 | above | HARD_LIMITS | 0 | 0.00 |
GLOBAL_HEALTH_VAS_0 | below | SOFT_LIMITS | 257 | 9.82 |
GLOBAL_HEALTH_VAS_0 | within | SOFT_LIMITS | 2090 | 79.83 |
GLOBAL_HEALTH_VAS_0 | above | SOFT_LIMITS | 271 | 10.35 |
ASTHMA_0 | below | HARD_LIMITS | 0 | 0.00 |
ASTHMA_0 | within | HARD_LIMITS | 2641 | 100.00 |
ASTHMA_0 | above | HARD_LIMITS | 0 | 0.00 |
ARM_CIRC_0 | below | HARD_LIMITS | 0 | 0.00 |
ARM_CIRC_0 | within | HARD_LIMITS | 2657 | 100.00 |
ARM_CIRC_0 | above | HARD_LIMITS | 0 | 0.00 |
ARM_CIRC_0 | below | SOFT_LIMITS | 0 | 0.00 |
ARM_CIRC_0 | within | SOFT_LIMITS | 2657 | 100.00 |
ARM_CIRC_0 | above | SOFT_LIMITS | 0 | 0.00 |
ARM_CIRC_DISC_0 | below | HARD_LIMITS | 0 | 0.00 |
ARM_CIRC_DISC_0 | within | HARD_LIMITS | 2633 | 100.00 |
ARM_CIRC_DISC_0 | above | HARD_LIMITS | 0 | 0.00 |
ARM_CUFF_0 | below | HARD_LIMITS | 0 | 0.00 |
ARM_CUFF_0 | within | HARD_LIMITS | 2623 | 100.00 |
ARM_CUFF_0 | above | HARD_LIMITS | 0 | 0.00 |
EXAM_DT_0 | below | HARD_LIMITS | 0 | 0.00 |
EXAM_DT_0 | within | HARD_LIMITS | 2940 | 100.00 |
EXAM_DT_0 | above | HARD_LIMITS | 0 | 0.00 |
CRP_0 | below | HARD_LIMITS | 0 | 0.00 |
CRP_0 | within | HARD_LIMITS | 2699 | 100.00 |
CRP_0 | above | HARD_LIMITS | 0 | 0.00 |
CRP_0 | below | DETECTION_LIMITS | 5 | 0.19 |
CRP_0 | within | DETECTION_LIMITS | 2694 | 99.81 |
CRP_0 | above | DETECTION_LIMITS | 0 | 0.00 |
CRP_0 | below | SOFT_LIMITS | 130 | 4.82 |
CRP_0 | within | SOFT_LIMITS | 2561 | 94.89 |
CRP_0 | above | SOFT_LIMITS | 8 | 0.30 |
BSG_0 | below | HARD_LIMITS | 0 | 0.00 |
BSG_0 | within | HARD_LIMITS | 2686 | 100.00 |
BSG_0 | above | HARD_LIMITS | 0 | 0.00 |
BSG_0 | below | SOFT_LIMITS | 92 | 3.43 |
BSG_0 | within | SOFT_LIMITS | 2264 | 84.29 |
BSG_0 | above | SOFT_LIMITS | 330 | 12.29 |
LAB_DT_0 | below | HARD_LIMITS | 0 | 0.00 |
LAB_DT_0 | within | HARD_LIMITS | 2940 | 100.00 |
LAB_DT_0 | above | HARD_LIMITS | 0 | 0.00 |
EDUCATION_0 | below | HARD_LIMITS | 0 | 0.00 |
EDUCATION_0 | within | HARD_LIMITS | 2472 | 100.00 |
EDUCATION_0 | above | HARD_LIMITS | 0 | 0.00 |
EDUCATION_1 | below | HARD_LIMITS | 0 | 0.00 |
EDUCATION_1 | within | HARD_LIMITS | 2422 | 99.88 |
EDUCATION_1 | above | HARD_LIMITS | 3 | 0.12 |
MARRIED_0 | below | HARD_LIMITS | 0 | 0.00 |
MARRIED_0 | within | HARD_LIMITS | 2366 | 100.00 |
MARRIED_0 | above | HARD_LIMITS | 0 | 0.00 |
N_CHILD_0 | below | SOFT_LIMITS | 0 | 0.00 |
N_CHILD_0 | within | SOFT_LIMITS | 2249 | 96.28 |
N_CHILD_0 | above | SOFT_LIMITS | 87 | 3.72 |
EATING_PREFS_0 | below | HARD_LIMITS | 0 | 0.00 |
EATING_PREFS_0 | within | HARD_LIMITS | 2328 | 100.00 |
EATING_PREFS_0 | above | HARD_LIMITS | 0 | 0.00 |
MEAT_CONS_0 | below | HARD_LIMITS | 0 | 0.00 |
MEAT_CONS_0 | within | HARD_LIMITS | 2302 | 100.00 |
MEAT_CONS_0 | above | HARD_LIMITS | 0 | 0.00 |
SMOKING_0 | below | HARD_LIMITS | 0 | 0.00 |
SMOKING_0 | within | HARD_LIMITS | 2292 | 100.00 |
SMOKING_0 | above | HARD_LIMITS | 0 | 0.00 |
SMOKE_SHOP_0 | below | HARD_LIMITS | 0 | 0.00 |
SMOKE_SHOP_0 | within | HARD_LIMITS | 782 | 97.02 |
SMOKE_SHOP_0 | above | HARD_LIMITS | 24 | 2.98 |
N_INJURIES_0 | below | SOFT_LIMITS | 0 | 0.00 |
N_INJURIES_0 | within | SOFT_LIMITS | 2161 | 98.27 |
N_INJURIES_0 | above | SOFT_LIMITS | 38 | 1.73 |
N_BIRTH_0 | below | SOFT_LIMITS | 0 | 0.00 |
N_BIRTH_0 | within | SOFT_LIMITS | 1098 | 99.91 |
N_BIRTH_0 | above | SOFT_LIMITS | 1 | 0.09 |
PREGNANT_0 | below | HARD_LIMITS | 0 | 0.00 |
PREGNANT_0 | within | HARD_LIMITS | 1065 | 100.00 |
PREGNANT_0 | above | HARD_LIMITS | 0 | 0.00 |
MEDICATION_0 | below | HARD_LIMITS | 0 | 0.00 |
MEDICATION_0 | within | HARD_LIMITS | 292 | 45.55 |
MEDICATION_0 | above | HARD_LIMITS | 349 | 54.45 |
N_ATC_CODES_0 | below | HARD_LIMITS | 0 | 0.00 |
N_ATC_CODES_0 | within | HARD_LIMITS | 2058 | 100.00 |
N_ATC_CODES_0 | above | HARD_LIMITS | 0 | 0.00 |
INT_DT_0 | below | HARD_LIMITS | 0 | 0.00 |
INT_DT_0 | within | HARD_LIMITS | 2940 | 100.00 |
INT_DT_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_1_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_1_0 | within | HARD_LIMITS | 2248 | 100.00 |
ITEM_1_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_2_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_2_0 | within | HARD_LIMITS | 2197 | 100.00 |
ITEM_2_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_3_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_3_0 | within | HARD_LIMITS | 2184 | 100.00 |
ITEM_3_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_4_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_4_0 | within | HARD_LIMITS | 2143 | 100.00 |
ITEM_4_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_5_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_5_0 | within | HARD_LIMITS | 2074 | 100.00 |
ITEM_5_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_6_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_6_0 | within | HARD_LIMITS | 2048 | 100.00 |
ITEM_6_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_7_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_7_0 | within | HARD_LIMITS | 2068 | 100.00 |
ITEM_7_0 | above | HARD_LIMITS | 0 | 0.00 |
ITEM_8_0 | below | HARD_LIMITS | 0 | 0.00 |
ITEM_8_0 | within | HARD_LIMITS | 2013 | 100.00 |
ITEM_8_0 | above | HARD_LIMITS | 0 | 0.00 |
QUEST_DT_0 | below | HARD_LIMITS | 9 | 0.31 |
QUEST_DT_0 | within | HARD_LIMITS | 2931 | 99.69 |
QUEST_DT_0 | above | HARD_LIMITS | 0 | 0.00 |
Output 3: Plot List
Here, only five plots are displayed. However, for each variable with limits, a plot has been generated.
datetime
The con_limit_deviations
function can also be applied to
datetime
variables:
limit_deviations_3 <- con_limit_deviations(resp_vars = c("QUEST_DT_0"),
label_col = "LABEL",
study_data = sd1,
meta_data = md1,
limits = "HARD_LIMITS")
## Did not find any 'SCALE_LEVEL' column in item-level meta_data. Predicting it from the data -- please verify these predictions, they may be wrong and lead to functions claiming not to be reasonably applicable to a variable.
Output 2: Summary Data
Variables | Section | Limits | Number | Percentage |
---|---|---|---|---|
QUEST_DT_0 | below | HARD_LIMITS | 9 | 0.31 |
QUEST_DT_0 | within | HARD_LIMITS | 2931 | 99.69 |
QUEST_DT_0 | above | HARD_LIMITS | 0 | 0.00 |
Output 3: Plot List
The definition of HARD_LIMITS
is a common issue in the
data curation process. For example, values of a numeric rating scale (0
- 10) should not exceed these limits and values outside these limits
must be removed or at least verified as they represent certain incorrect
measurements. Nevertheless, there are measurements in which the
definition of such limits is difficult. In this case the alternative
definition of SOFT_LIMITS
is recommended.