The function com_unit_missingness
targets unit
missingness or unit nonresponse
(Kalton and Kasprzyk 1986). It does so
without analyzing the reason for why data is missing. This is why
com_unit_missingness
is an implementation of the Missing values indicator, which belongs
to the Crude Missingness domain in
the Completeness dimension.
com_unit_missingness
checks if all measurement variables
in the provided study dataset are missing for an observational unit.
Therefore any decision on unit missingness is dependent on the scope of
the provided dataset.
For more details, see the user’s manual and the source code.
com_unit_missingness(study_data,
meta_data,
id_vars = NULL,
strata_vars = NULL,
label_col
)
The com_unit_missingness
function has the following
arguments:
Crude unit missingness can be calculated for stratified data. In this
case strata_vars
must be specified. There is no
implementation of a threshold value.
To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
The first example specifies the analyses of missing units without stratification:
unit_miss_1 <- com_unit_missingness(
study_data = sd1,
meta_data = md1,
id_vars = c("CENTER_0", "PSEUDO_ID"),
label_col = "LABEL"
)
The function outputs the lists FlaggedStudyData and SummaryData. FlaggedStudyData contains a data frame of the study data that uses flags to indicate observations without any measurements at all. SummaryData contains a vector of two elements: (1) the number of observations showing unit missingness, and (2) the percentage of unit missingness.
Run unit_miss_1$SummaryData
to see the summary output.
In this example, unit missingness is observed in n = 60 observations,
which equals 2% in this dataset.
Unit missingness can also be calculated using a discrete variable for stratification, for example, in multi-center studies:
unit_miss_2 <- com_unit_missingness(
study_data = sd1,
meta_data = md1,
id_vars = c("CENTER_0", "PSEUDO_ID"),
strata_vars = "CENTER_0",
label_col = "LABEL"
)
The stratified summary data frame output provides indicates unit
missingness for each stratum, unit_miss_2$SummaryData
:
CENTER_0 | N_OBS | N_UNIT_MISSINGS | N_UNIT_MISSINGS_(%) |
---|---|---|---|
Berlin | 617 | 15 | 2.43 |
Hamburg | 581 | 11 | 1.89 |
Leipzig | 593 | 9 | 1.52 |
Cologne | 564 | 13 | 2.30 |
Munich | 585 | 12 | 2.05 |
com_unit_missingness
provides the number and proportion
of units without a single valid measurement value on any provided
variable. Generally, the higher the proportion on units with missing
data, the lower the data quality.
Unit missingness should be distinguished from segment and item missingness because it may have different causes and underlying mechanisms. For example, unit-nonresponse may be selective regarding the targeted study population or may occur due to technical reasons, such as record linkage.
Some notes of caution apply:
com_unit_missingness
calculates a crude rate of unit
missingness, meaning that it ignores the reason for why information is
missing. As missingness may have several causes
com_unit_missingness
will for example miss out on design
related missingness, which does not, per se, relate to an inferior data
quality.
com_unit_missingness
only looks at the provided
variables. Thus, results tells that for the intended scope of variables
no information comes from any observational unit. In terms of the
conceptual distinction unit, segment, item missingness, the results of
com_unit_missingness
may vary. Take for example if the
variables to be checked only come from one segment (e.g. one
examination) of a study. The meaning of results in this case is, in fact
not unit missingness but segment or maybe even item missingness. Users
must therefore keep the scope of variables in mind to correctly
interpret results.
Variables that provide non-measurement information of relevance like IDs must be excluded from the analyses to get any meaningful results. Generally, all variables, that are by default filled out completely independent of participation status must be excluded.