Item missingness (also called item nonresponse; De Leeuw et al. 2003) describes the missingness of single values in a data set, such as blanks or empty data cells. For example, item missingness occurs if a participant does not provide information for a certain question, a question is overlooked by accident, a programming failure occurs, or a provided answer was missed while entering the data.
The com_item_missingness
function implements the Missing values indicator, which belongs
to the Crude Missingness domain in
the Completeness dimension.
Additionally, com_item_missingness
is an implementation of
the Uncertain missingness status
indicator, which belongs to the Value
format error domain in the Integrity dimension.
For more details, see the user’s manual and the source code.
com_item_missingness(
study_data = sd1,
meta_data = md1,
label_col = "LABEL",
show_causes = TRUE,
cause_label_df = code_labels,
include_sysmiss = TRUE,
threshold_value = 80
)
The com_item_missingness
function has the following
arguments:
TRUE
, then
the distribution of underlying missing codes is cross-tabulated and
illustrated in a figure.TRUE
,
system missings (NAs
) are shown in the summary plot.FALSE
. If TRUE
, suppresses any warnings about
using mixed codes for missings and jumps.To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
For the item missingness function, the metadata columns
MISSING_LIST
and JUMP_LIST
are crucial.
Moreover, MISSING_LIST
and JUMP_LIST
should be
disjoint sets:
MISSING_LIST
: a list of numeric codes used to qualify
the reasons for missing values. The list must be pipe | separated to be
interpretable by the function.JUMP_LIST
: a list of numeric codes used to qualify the
reasons for expected missing values, for example, by design. The list
must be pipe | separated to be interpretable by the function.VAR_NAMES | LABEL | MISSING_LIST | JUMP_LIST | |
---|---|---|---|---|
4 | v00003 | AGE_0 | NA | NA |
39 | v00030 | MEDICATION_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA |
1 | v00000 | CENTER_0 | NA | NA |
34 | v00025 | SMOKE_SHOP_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA |
23 | v00016 | DEV_NO_0 | NA | NA |
43 | v40000 | PART_INTERVIEW | NA | NA |
14 | v00009 | ARM_CIRC_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99987 | 99988 | 99989 | 99990 | 99991 | 99992 | 99993 | 99994 | 99995 | NA |
18 | v00012 | USR_BP_0 | 99981 | 99982 | NA |
33 | v00024 | SMOKING_0 | 99980 | 99983 | 99988 | 99989 | 99990 | 99991 | 99993 | 99994 | 99995 | NA |
21 | v00014 | CRP_0 | 99980 | 99981 | 99982 | 99983 | 99984 | 99985 | 99986 | 99988 | 99989 | 99990 | 99991 | 99992 | 99994 | 99995 | NA |
This example considers labeled missing codes, which are optional but recommended. The missing code labels for the simulated study data are loaded as follows:
file_name <- system.file("extdata", "meta_data_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
code_labels <- prep_get_data_frame("missing_table") # missing_table is a sheet in meta_data_v2.xlsx
CODE_VALUE | CODE_LABEL | CODE_INTERPRET | CODE_CLASS |
---|---|---|---|
99980 | Missing - other reason | O | MISSING |
99981 | Missing - exclusion criteria | NE | MISSING |
99982 | Missing - refusal | R | MISSING |
99983 | Missing - not assessable | NC | MISSING |
99984 | Missing - technical problem | O | MISSING |
99985 | Missing - not available (material) | NC | MISSING |
99986 | Missing - not usable (material) | NC | MISSING |
99987 | Missing - reason unknown | O | MISSING |
99988 | Missing - optional value | NE | MISSING |
99989 | Deleted - other reason | O | MISSING |
99990 | Deleted - contradiction | O | MISSING |
99991 | Deleted - value outside limits | O | MISSING |
99992 | Inaccurate - other reason | O | MISSING |
99993 | Inaccurate - value above detection limit | O | MISSING |
99994 | Inaccurate - value below detection limit | O | MISSING |
99995 | Data management ongoing | O | MISSING |
88880 | JUMP 88880 | NE | JUMP |
88890 | JUMP 88890 | NE | JUMP |
CAVEAT: if missing codes are not harmonized across all the study variables (i.e., labels of missing codes vary across variables), this option should not be used.
The next function call sets the analyses of missing values causes and
system missings to TRUE
, includes the missing code labels,
and sets the threshold to 80%.
item_miss_1 <- com_item_missingness(
study_data = sd1,
meta_data = md1,
label_col = "LABEL",
show_causes = TRUE,
cause_label_df = code_labels,
include_sysmiss = TRUE,
threshold_value = 80
)
The function outputs a list containing the object SummaryTable, which
includes item missingness per response variable. Run
item_miss_1$SummaryTable
to see the data frame:
Variables | Expected observations N | Sysmiss N | Datavalues N | Missing codes N | Jumps N | Measurements N | Missing_expected_obs | PCT_com_crm_mv | GRADING |
---|---|---|---|---|---|---|---|---|---|
CENTER_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
PSEUDO_ID | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
SEX_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
AGE_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
AGE_GROUP_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
AGE_1 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
SEX_1 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
PART_STUDY | 3000 | 60 | 2940 | 0 | 0 | 2940 | 60 | 2.00 | 0 |
SBP_0 | 2940 | 239 | 2701 | 140 | 0 | 2561 | 379 | 12.63 | 0 |
DBP_0 | 2940 | 233 | 2707 | 163 | 0 | 2544 | 396 | 13.20 | 0 |
GLOBAL_HEALTH_VAS_0 | 2940 | 246 | 2694 | 76 | 0 | 2618 | 322 | 10.73 | 0 |
ASTHMA_0 | 2940 | 227 | 2713 | 72 | 0 | 2641 | 299 | 9.97 | 0 |
VO2_CAPCAT_0 | 2940 | 225 | 2715 | 120 | 0 | 2595 | 345 | 11.50 | 0 |
ARM_CIRC_0 | 2940 | 220 | 2720 | 63 | 0 | 2657 | 283 | 9.43 | 0 |
ARM_CIRC_DISC_0 | 2940 | 238 | 2702 | 69 | 0 | 2633 | 307 | 10.23 | 0 |
ARM_CUFF_0 | 2940 | 236 | 2704 | 81 | 0 | 2623 | 317 | 10.57 | 0 |
USR_VO2_0 | 2940 | 89 | 2851 | 69 | 0 | 2782 | 158 | 5.27 | 0 |
USR_BP_0 | 2940 | 80 | 2860 | 85 | 0 | 2775 | 165 | 5.50 | 0 |
EXAM_DT_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
PART_PHYS_EXAM | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
CRP_0 | 2940 | 172 | 2768 | 69 | 0 | 2699 | 241 | 8.03 | 0 |
BSG_0 | 2940 | 182 | 2758 | 72 | 0 | 2686 | 254 | 8.47 | 0 |
DEV_NO_0 | 2940 | 248 | 2692 | 0 | 0 | 2692 | 248 | 8.27 | 0 |
LAB_DT_0 | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
PART_LAB | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
EDUCATION_0 | 2924 | 72 | 2852 | 380 | 0 | 2472 | 452 | 15.07 | 0 |
EDUCATION_1 | 2924 | 83 | 2841 | 416 | 0 | 2425 | 499 | 16.63 | 0 |
FAM_STAT_0 | 2924 | 122 | 2802 | 413 | 0 | 2389 | 535 | 17.83 | 0 |
MARRIED_0 | 2924 | 126 | 2798 | 432 | 0 | 2366 | 558 | 18.60 | 0 |
N_CHILD_0 | 2924 | 160 | 2764 | 428 | 0 | 2336 | 588 | 19.60 | 1 |
EATING_PREFS_0 | 2924 | 148 | 2776 | 448 | 0 | 2328 | 596 | 19.87 | 1 |
MEAT_CONS_0 | 2924 | 171 | 2753 | 451 | 0 | 2302 | 622 | 20.73 | 1 |
SMOKING_0 | 2924 | 183 | 2741 | 449 | 0 | 2292 | 632 | 21.07 | 1 |
SMOKE_SHOP_0 | 2924 | 1605 | 1319 | 513 | 0 | 806 | 2118 | 70.60 | 1 |
N_INJURIES_0 | 2924 | 244 | 2680 | 481 | 0 | 2199 | 725 | 24.17 | 1 |
N_BIRTH_0 | 2924 | 213 | 2711 | 499 | 1113 | 1099 | 1825 | 60.83 | 1 |
INCOME_GROUP_0 | 2924 | 235 | 2689 | 515 | 0 | 2174 | 750 | 25.00 | 1 |
PREGNANT_0 | 2924 | 274 | 2650 | 519 | 1066 | 1065 | 1859 | 61.97 | 1 |
MEDICATION_0 | 2924 | 1733 | 1191 | 550 | 0 | 641 | 2283 | 76.10 | 1 |
N_ATC_CODES_0 | 2924 | 310 | 2614 | 556 | 0 | 2058 | 866 | 28.87 | 1 |
USR_SOCDEM_0 | 2924 | 306 | 2618 | 332 | 0 | 2286 | 638 | 21.27 | 1 |
INT_DT_0 | 2924 | 0 | 2924 | 0 | 0 | 2924 | 0 | 0.00 | 0 |
PART_INTERVIEW | 2940 | 0 | 2940 | 0 | 0 | 2940 | 0 | 0.00 | 0 |
ITEM_1_0 | 2864 | 317 | 2547 | 299 | 0 | 2248 | 616 | 20.53 | 1 |
ITEM_2_0 | 2864 | 343 | 2521 | 324 | 0 | 2197 | 667 | 22.23 | 1 |
ITEM_3_0 | 2864 | 355 | 2509 | 325 | 0 | 2184 | 680 | 22.67 | 1 |
ITEM_4_0 | 2864 | 347 | 2517 | 374 | 0 | 2143 | 721 | 24.03 | 1 |
ITEM_5_0 | 2864 | 416 | 2448 | 374 | 0 | 2074 | 790 | 26.33 | 1 |
ITEM_6_0 | 2864 | 427 | 2437 | 389 | 0 | 2048 | 816 | 27.20 | 1 |
ITEM_7_0 | 2864 | 395 | 2469 | 401 | 0 | 2068 | 796 | 26.53 | 1 |
ITEM_8_0 | 2864 | 424 | 2440 | 427 | 0 | 2013 | 851 | 28.37 | 1 |
QUEST_DT_0 | 2864 | 0 | 2864 | 0 | 0 | 2864 | 0 | 0.00 | 0 |
PART_QUESTIONNAIRE | 2924 | 0 | 2924 | 0 | 0 | 2924 | 0 | 0.00 | 0 |
When show_causes = TRUE
, the output list includes a
SummaryData and SummaryPlot. SummaryPlot is the data frame underlying
the SummaryData. Call it using item_miss_1$SummaryPlot
:
An optional argument of this function is resp_vars
,
which can be used to compute item missingness only for selected
variables:
item_miss_2 <- com_item_missingness(
study_data = sd1,
meta_data = md1,
resp_vars = c(
"AGE_0", "SBP_0", "DBP_0", "N_INJURIES_0",
"N_CHILD_0", "INCOME_GROUP_0"
),
label_col = "LABEL",
show_causes = TRUE,
cause_label_df = code_labels,
include_sysmiss = FALSE,
threshold_value = 80
)
The output provides the same information as above, but restricted to
the selected variables. item_miss_2$SummaryPlot
:
The higher the percentage of item missingness, the lower the data quality.
NA
) in each
variable.show_causes = TRUE
, provide a summary
plot.