Description

Item missingness (also called item nonresponse; De Leeuw et al. 2003) describes the missingness of single values in a data set, such as blanks or empty data cells. For example, item missingness occurs if a participant does not provide information for a certain question, a question is overlooked by accident, a programming failure occurs, or a provided answer was missed while entering the data.

The com_item_missingness function implements the Missing values indicator, which belongs to the Crude Missingness domain in the Completeness dimension. Additionally, com_item_missingness is an implementation of the Uncertain missingness status indicator, which belongs to the Value format error domain in the Integrity dimension.

For more details, see the user’s manual and the source code.

Usage and arguments

com_item_missingness(
  study_data = sd1,
  meta_data = md1,
  label_col = "LABEL",
  show_causes = TRUE,
  cause_label_df = code_labels,
  include_sysmiss = TRUE,
  threshold_value = 80
)

The com_item_missingness function has the following arguments:

  • study_data: mandatory, the data frame containing the measurements.
  • meta_data: mandatory, the data frame containing the study data’s metadata.
  • resp_vars: optional (but recommended), a character vector specifying the measurement variables of interest.
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.
  • show_causes: optional, if TRUE, then the distribution of underlying missing codes is cross-tabulated and illustrated in a figure.
  • cause_label_df: optional, the data frame containing the labels of the missing codes. Should only be used if missing codes are harmonized across all the study variables.
  • include_sysmiss: optional, if TRUE, system missings (NAs) are shown in the summary plot.
  • threshold_value: mandatory, a percentage between 0 and 100. All variables with item missingness below the threshold value are flagged.
  • suppressWarnings: optional with default FALSE. If TRUE, suppresses any warnings about using mixed codes for missings and jumps.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

For the item missingness function, the metadata columns MISSING_LIST and JUMP_LIST are crucial. Moreover, MISSING_LIST and JUMP_LIST should be disjoint sets:

  1. MISSING_LIST: a list of numeric codes used to qualify the reasons for missing values. The list must be pipe | separated to be interpretable by the function.
  2. JUMP_LIST: a list of numeric codes used to qualify the reasons for expected missing values, for example, by design. The list must be pipe | separated to be interpretable by the function.
VAR_NAMES LABEL MISSING_LIST JUMP_LIST
4 v00003 AGE_0 NA NA
39 v00030 MEDICATION_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA
1 v00000 CENTER_0 NA NA
34 v00025 SMOKE_SHOP_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA
23 v00016 DEV_NO_0 NA NA
43 v40000 PART_INTERVIEW NA NA
14 v00009 ARM_CIRC_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 NA
18 v00012 USR_BP_0 99981 | 99982 NA
33 v00024 SMOKING_0 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 NA
21 v00014 CRP_0 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 NA


This example considers labeled missing codes, which are optional but recommended. The missing code labels for the simulated study data are loaded as follows:

file_name <- system.file("extdata", "meta_data_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
code_labels <- prep_get_data_frame("missing_table") # missing_table is a sheet in meta_data_v2.xlsx
CODE_VALUE CODE_LABEL CODE_INTERPRET CODE_CLASS
99980 Missing - other reason O MISSING
99981 Missing - exclusion criteria NE MISSING
99982 Missing - refusal R MISSING
99983 Missing - not assessable NC MISSING
99984 Missing - technical problem O MISSING
99985 Missing - not available (material) NC MISSING
99986 Missing - not usable (material) NC MISSING
99987 Missing - reason unknown O MISSING
99988 Missing - optional value NE MISSING
99989 Deleted - other reason O MISSING
99990 Deleted - contradiction O MISSING
99991 Deleted - value outside limits O MISSING
99992 Inaccurate - other reason O MISSING
99993 Inaccurate - value above detection limit O MISSING
99994 Inaccurate - value below detection limit O MISSING
99995 Data management ongoing O MISSING
88880 JUMP 88880 NE JUMP
88890 JUMP 88890 NE JUMP


CAVEAT: if missing codes are not harmonized across all the study variables (i.e., labels of missing codes vary across variables), this option should not be used.

The next function call sets the analyses of missing values causes and system missings to TRUE, includes the missing code labels, and sets the threshold to 80%.

item_miss_1 <- com_item_missingness(
  study_data = sd1,
  meta_data = md1,
  label_col = "LABEL",
  show_causes = TRUE,
  cause_label_df = code_labels,
  include_sysmiss = TRUE,
  threshold_value = 80
)

The function outputs a list containing the object SummaryTable, which includes item missingness per response variable. Run item_miss_1$SummaryTable to see the data frame:

Variables Expected observations N Sysmiss N Datavalues N Missing codes N Jumps N Measurements N Missing_expected_obs PCT_com_crm_mv GRADING
CENTER_0 2940 0 2940 0 0 2940 0 0.00 0
PSEUDO_ID 2940 0 2940 0 0 2940 0 0.00 0
SEX_0 2940 0 2940 0 0 2940 0 0.00 0
AGE_0 2940 0 2940 0 0 2940 0 0.00 0
AGE_GROUP_0 2940 0 2940 0 0 2940 0 0.00 0
AGE_1 2940 0 2940 0 0 2940 0 0.00 0
SEX_1 2940 0 2940 0 0 2940 0 0.00 0
PART_STUDY 3000 60 2940 0 0 2940 60 2.00 0
SBP_0 2940 239 2701 140 0 2561 379 12.63 0
DBP_0 2940 233 2707 163 0 2544 396 13.20 0
GLOBAL_HEALTH_VAS_0 2940 246 2694 76 0 2618 322 10.73 0
ASTHMA_0 2940 227 2713 72 0 2641 299 9.97 0
VO2_CAPCAT_0 2940 225 2715 120 0 2595 345 11.50 0
ARM_CIRC_0 2940 220 2720 63 0 2657 283 9.43 0
ARM_CIRC_DISC_0 2940 238 2702 69 0 2633 307 10.23 0
ARM_CUFF_0 2940 236 2704 81 0 2623 317 10.57 0
USR_VO2_0 2940 89 2851 69 0 2782 158 5.27 0
USR_BP_0 2940 80 2860 85 0 2775 165 5.50 0
EXAM_DT_0 2940 0 2940 0 0 2940 0 0.00 0
PART_PHYS_EXAM 2940 0 2940 0 0 2940 0 0.00 0
CRP_0 2940 172 2768 69 0 2699 241 8.03 0
BSG_0 2940 182 2758 72 0 2686 254 8.47 0
DEV_NO_0 2940 248 2692 0 0 2692 248 8.27 0
LAB_DT_0 2940 0 2940 0 0 2940 0 0.00 0
PART_LAB 2940 0 2940 0 0 2940 0 0.00 0
EDUCATION_0 2924 72 2852 380 0 2472 452 15.07 0
EDUCATION_1 2924 83 2841 416 0 2425 499 16.63 0
FAM_STAT_0 2924 122 2802 413 0 2389 535 17.83 0
MARRIED_0 2924 126 2798 432 0 2366 558 18.60 0
N_CHILD_0 2924 160 2764 428 0 2336 588 19.60 1
EATING_PREFS_0 2924 148 2776 448 0 2328 596 19.87 1
MEAT_CONS_0 2924 171 2753 451 0 2302 622 20.73 1
SMOKING_0 2924 183 2741 449 0 2292 632 21.07 1
SMOKE_SHOP_0 2924 1605 1319 513 0 806 2118 70.60 1
N_INJURIES_0 2924 244 2680 481 0 2199 725 24.17 1
N_BIRTH_0 2924 213 2711 499 1113 1099 1825 60.83 1
INCOME_GROUP_0 2924 235 2689 515 0 2174 750 25.00 1
PREGNANT_0 2924 274 2650 519 1066 1065 1859 61.97 1
MEDICATION_0 2924 1733 1191 550 0 641 2283 76.10 1
N_ATC_CODES_0 2924 310 2614 556 0 2058 866 28.87 1
USR_SOCDEM_0 2924 306 2618 332 0 2286 638 21.27 1
INT_DT_0 2924 0 2924 0 0 2924 0 0.00 0
PART_INTERVIEW 2940 0 2940 0 0 2940 0 0.00 0
ITEM_1_0 2864 317 2547 299 0 2248 616 20.53 1
ITEM_2_0 2864 343 2521 324 0 2197 667 22.23 1
ITEM_3_0 2864 355 2509 325 0 2184 680 22.67 1
ITEM_4_0 2864 347 2517 374 0 2143 721 24.03 1
ITEM_5_0 2864 416 2448 374 0 2074 790 26.33 1
ITEM_6_0 2864 427 2437 389 0 2048 816 27.20 1
ITEM_7_0 2864 395 2469 401 0 2068 796 26.53 1
ITEM_8_0 2864 424 2440 427 0 2013 851 28.37 1
QUEST_DT_0 2864 0 2864 0 0 2864 0 0.00 0
PART_QUESTIONNAIRE 2924 0 2924 0 0 2924 0 0.00 0


When show_causes = TRUE, the output list includes a SummaryData and SummaryPlot. SummaryPlot is the data frame underlying the SummaryData. Call it using item_miss_1$SummaryPlot:

Item missingness for selected variables

An optional argument of this function is resp_vars, which can be used to compute item missingness only for selected variables:

item_miss_2 <- com_item_missingness(
  study_data = sd1,
  meta_data = md1,
  resp_vars = c(
    "AGE_0", "SBP_0", "DBP_0", "N_INJURIES_0",
    "N_CHILD_0", "INCOME_GROUP_0"
  ),
  label_col = "LABEL",
  show_causes = TRUE,
  cause_label_df = code_labels,
  include_sysmiss = FALSE,
  threshold_value = 80
)

The output provides the same information as above, but restricted to the selected variables. item_miss_2$SummaryPlot:

Interpretation

The higher the percentage of item missingness, the lower the data quality.

Algorithm of the implementation

  1. From the metadata, select lists of missing codes and, if applicable, jump codes.
  2. Calculate the number of system missings (NA) in each variable.
  3. Calculate the number of missing codes in each variable.
  4. Calculate the number of jump codes in each variable.
  5. Generate two data frames, a summary on the level of observations and a summary for each variable.
  6. Optional: if show_causes = TRUE, provide a summary plot.

Concept relations

De Leeuw, E.D., Hox, J.J., and Huisman, M. (2003). Prevention and treatment of item nonresponse. Journal of Official Statistics 19, 153–176.