Description

The function con_inadmissible_categorical examines if all observed levels in the study data are valid according to the value lists defined in the metadata for each categorical variable. Thus, con_inadmissible_categorical is an implementation of the Inadmissible categorical values indicator, which belongs to the Range and value violations domain in the Consistency dimension.

For more details, see the user’s manual and the source code.

Usage and arguments

con_inadmissible_categorical(
  resp_vars = NULL,
  study_data = sd1,
  meta_data = md1,
  label_col = NULL,
  threshold = NULL
)

The con_inadmissible_categorical function has the following arguments:

  • resp_vars: a character vector of categorical variables
  • study_data: the name of the data frame that contains the measurements
  • meta_data: the name of the data frame that contains metadata attributes of study data
  • label_col: if labels should be used specify column of metadata containing the labels

No threshold is implemented.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the con_inadmissible_categorical function, the columns VALUE_LABELS, MISSING_LIST and JUMP_LIST in the metadata are particularly relevant.

VALUE_LABELS have to be defined as follows:

  • \("0 = male \: | \: 1 = female"\)

  • \("A = good \: | \: B = moderate \: | \: C = bad "\)

This table shows the metadata defined for the example data that required for this implementation:

VAR_NAMES LABEL MISSING_LIST JUMP_LIST VALUE_LABELS
5 v00103 AGE_GROUP_0 NA NA NA
12 v00007 ASTHMA_0 99980 | 99988 | 99989 | 99991 | 99993 | 99994 | 99995 NA 0 = no | 1 = yes
39 v00030 MEDICATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA 0 = no | 1 = yes
36 v00027 N_BIRTH_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 88880 NA
40 v00031 N_ATC_CODES_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA NA
43 v40000 PART_INTERVIEW NA NA 0 = no | 1 = yes
31 v00022 EATING_PREFS_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA 0 = none | 1 = vegetarian | 2 = vegan
8 v10000 PART_STUDY NA NA 0 = no | 1 = yes
20 v20000 PART_PHYS_EXAM NA NA 0 = no | 1 = yes
10 v00005 DBP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA NA


In this example, all variables with assigned VALUE_LABELS will be examined.

IAVCatAll <- con_inadmissible_categorical(study_data = sd1,
                                          meta_data  = md1,
                                          label_col  = "LABEL")
names(IAVCatAll)
## [1] "SummaryData"       "SummaryTable"      "ModifiedStudyData"
## [4] "FlaggedStudyData"

Summary Table:

The first output object contains a summary of all examined variables/data elements. Those showing categories that were not specified in the metadata are flagged, i.e. the column GRADING has the value 1.

Variables NUM_con_rvv_icat PCT_con_rvv_icat GRADING FLG_con_rvv_icat
CENTER_0 0 0.0 0 FALSE
SEX_0 0 0.0 0 FALSE
SEX_1 0 0.0 0 FALSE
PART_STUDY 0 0.0 0 FALSE
ASTHMA_0 0 0.0 0 FALSE
VO2_CAPCAT_0 0 0.0 0 FALSE
ARM_CIRC_DISC_0 0 0.0 0 FALSE
ARM_CUFF_0 0 0.0 0 FALSE
USR_VO2_0 0 0.0 0 FALSE
USR_BP_0 0 0.0 0 FALSE
PART_PHYS_EXAM 0 0.0 0 FALSE
PART_LAB 0 0.0 0 FALSE
EDUCATION_0 0 0.0 0 FALSE
EDUCATION_1 3 0.1 1 TRUE
FAM_STAT_0 2389 79.6 1 TRUE
MARRIED_0 0 0.0 0 FALSE
EATING_PREFS_0 0 0.0 0 FALSE
MEAT_CONS_0 0 0.0 0 FALSE
SMOKING_0 0 0.0 0 FALSE
SMOKE_SHOP_0 24 0.8 1 TRUE
INCOME_GROUP_0 0 0.0 0 FALSE
PREGNANT_0 0 0.0 0 FALSE
MEDICATION_0 349 11.6 1 TRUE
USR_SOCDEM_0 172 5.7 1 TRUE
PART_INTERVIEW 0 0.0 0 FALSE
PART_QUESTIONNAIRE 0 0.0 0 FALSE


Modified data:

The modified data set is similar to the study data but inadmissible values were removed. For example, in education_1 those values of category “7” have been replaced by NA three times.


Flagged data:

For each variable with inadmissible values a separate columns is added flagging the observations with inadmissible categories.

EDUCATION_1_IAV FAM_STAT_0_IAV SMOKE_SHOP_0_IAV MEDICATION_0_IAV USR_SOCDEM_0_IAV
0 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
0 0 0 0 1
0 0 0 0 0

Interpretation

The higher the number of inadmissible values the lower the data quality in each data element. Similarly, the higher the number of data elements with inadmissible values the lower the data quality.

Note: If the majority of data values appear inadmissible, the correct specification of metadata should be reviewed.

Algorithm of the implementation

  1. Remove missing codes from the study data (if defined in the metadata)
  2. Interpretation of variable specific VALUE_LABELS as supplied in the metadata.
  3. Identification of measurements not corresponding to the expected categories. Therefore two output data frames are generated:
    • on the level of observation to flag each undefined category, and
    • a summary table for each variable.
  4. Values not corresponding to defined categories are removed in a data frame of modified study data

Concept relations