Segment missingness can be annotated in the study data (mainly) in two ways:
Participation in study segments is not recorded by specific variables. For example, there is no variable to acknowledge that a participant refused or could not take part in a specific examination, even though the all the measurements for this participant in this segment are missing.
There are specific variables to record individual participation in each study segment. For instance, a variable may indicate participation in the laboratory examination.
Use case (1) may be common in smaller studies. To calculate segment missingness, this implementation assumes that study variables are nested in respective segments and that the metadata specifies this structure. The function identifies all variables within each study segment, returns TRUE if all variables in a segment are missing, and FALSE otherwise.
Use case (2) assumes a more complex study data and metadata structure, with study data including so-called intro-variables (which can be either TRUE/FALSE or codes for non-participation). The column STUDY_SEGMENT (previously KEY_STUDY_SEGMENT) in the metadata contains the name of the segment (usually this is the respective intro-variable label for each measurement variable). The column PART_VAR contains the actual variable that describes whether the variable should be present or not, likely reflecting the hierarchical study structure. In the subsequent calculation of missingness, this structure allows obtaining the correct denominators to calculate missingness rates.
The com_segment_missingness
function implements the Missing values indicator, which belongs
to the Crude Missingness domain in
the Completeness dimension. For more
details, see the user’s
manual and the source
code.
com_segment_missingness(
study_data = sd1,
meta_data = md1,
label_col = "LABEL",
threshold_value = 5,
color_gradient_direction = "above",
exclude_roles = c("secondary", "process")
)
The com_segment_missingness
function has the following
arguments:
NULL
for output without
grouping.NULL
for no
stratification.above
, can be either above
or
below
with respect to the threshold_value
. Are
the critical deviations above or below the threshold value? If values
above the threshold are considered critical, above
should
be selected; otherwise, low
should be used. See also Algorithm of the
implementation.Segment missingness can be calculated for stratified data. In this
case strata_vars
must be specified.
To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
For the segment missingness function, the metadata column
STUDY_SEGMENT
is crucial. According to the use case (2)
(see Description), this column specifies the
intro-variable ID for each measurement variable. However, the content of
this column can also be strings.
VAR_NAMES | LABEL | DATA_TYPE | STUDY_SEGMENT | |
---|---|---|---|---|
4 | v00003 | AGE_0 | integer | STUDY |
39 | v00030 | MEDICATION_0 | integer | INTERVIEW |
1 | v00000 | CENTER_0 | integer | STUDY |
34 | v00025 | SMOKE_SHOP_0 | integer | INTERVIEW |
23 | v00016 | DEV_NO_0 | integer | LAB |
43 | v40000 | PART_INTERVIEW | integer | INTERVIEW |
14 | v00009 | ARM_CIRC_0 | float | PHYS_EXAM |
18 | v00012 | USR_BP_0 | string | PHYS_EXAM |
33 | v00024 | SMOKING_0 | integer | INTERVIEW |
21 | v00014 | CRP_0 | float | LAB |
The next function call specifies the analyses of missing segments without stratification, setting the threshold to 5%:
seg_miss_1 <- com_segment_missingness(
study_data = sd1,
meta_data = md1,
label_col = "LABEL",
threshold_value = 5,
direction = "high",
exclude_roles = c("secondary", "process")
)
The function outputs the lists SummaryData and ReportSummaryTable and
SummaryPlot. The SummaryData data frame expands over all possible
combinations of aux_variable
levels and examinations
identified in the metadata. The threshold_value
and the
color_gradient_direction
specified by the user are added to
the data frame. Since color_gradient_direction = "above"
all values above the threshold are considered critical and flagged with
GRADING = 1
.
Run seg_miss_1$SummaryData
to see the output:
Group | Examinations | No. of Participants | No. of missing segments | (%) of missing segments | threshold | direction | GRADING |
---|---|---|---|---|---|---|---|
1 | STUDY | 2940 | 0 | 0.00 | 5 | above | 0 |
1 | PHYS_EXAM | 2940 | 160 | 5.44 | 5 | above | 1 |
1 | LAB | 2940 | 113 | 3.84 | 5 | above | 0 |
1 | INTERVIEW | 2924 | 332 | 11.35 | 5 | above | 1 |
1 | QUESTIONNAIRE | 2864 | 0 | 0.00 | 5 | above | 0 |
The second output, ReportSummaryTable, is a heatmap-like graphic that
highlights critical values depending on the respective
threshold_value
and color_gradient_direction
.
Call it with mp1$SummaryPlot
:
For some analyses, it is necessary to add new, transformed variables to the study data:
# use the month function of the lubridate package to extract month of exam date
require(lubridate)
# apply changes to copy of data
sd2 <- sd1
# indicate first/second half year
sd2$month <- month(sd2$v00013)
In this case, the variable metadata must be added to the study metadata:
md_temp <- prep_add_to_meta(
VAR_NAMES = "month",
DATA_TYPE = "integer",
LABEL = "EXAM_MONTH",
VALUE_LABELS = "1 = January | 2 = February | 3 = March |
4 = April | 5 = May | 6 = June | 7 = July |
8 = August | 9 = September | 10 = October |
11 = November | 12 = December",
MISSING_LIST = "",
PART_VAR = "v20000",
meta_data = md1
)
A subsequent call of the function may include the new variable:
seg_miss_2 <- com_segment_missingness(
study_data = sd2,
meta_data = md_temp,
group_vars = "EXAM_MONTH",
label_col = "LABEL",
threshold_value = 1,
direction = "high",
exclude_roles = c("secondary", "process")
)
The output of mp1$SummaryPlot
now uses
facets
from the package ggplot()
, such that
the stratum from the new variable represents one facet:
This indicator uses a simple user-defined threshold. By default, the highest deviation from the threshold value is always displayed in dark red, irrespective of the absolute deviation. Classifying a deviation as critical is up to the user and involves qualitative interpretation.
This implementation uses one threshold to discriminate critical from
non-critical values. For instance, if threshold_value = 9
and color_gradient_direction = "above"
, then all values
lower than thethreshold_value
are considered normal
(displayed in dark blue in the plot and flagged with
GRADING = 0
in the data frame), and all values above the
threshold_value
are considered critical. The displayed
color shifts to a darker red as the values deviate more from the
threshold. All critical values are highlighted with
GRADING = 1
in the summary data frame. By default, the
highest values are always shown in dark red irrespective of the absolute
deviation.
Conversely, if color_gradient_direction = "below"
(for
the same threshold_value = 9
), all values greater than the
threshold_value
are assumed normal (displayed in dark blue
and with GRADING = 0
) and values lower than the
threshold_value
are considered as deviations.