We may be interested in an examination set or a group of examinations within a study for a data quality evaluation. In this case, we refer to the segment level when we use a subset of the data elements in the study database for data quality reporting. Here, for example, we might want to know whether we have the correct subset of participants for a given examination. This tutorial shows how to compute data quality indicators at the segment level using dataquieR.
To illustrate the functionalities, we use a subset of the example SHIP data that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
The basic workflow is:
library(dataquieR)
sd1 <- prep_get_data_frame("ship")
To evaluate data quality at the segment level, we must provide descriptions and expectations about sections of the study data in dataquieR’s metadata format. For instance, for the SHIP example data, the metadata may be loaded as follows:
prep_load_workbook_like_file("ship_meta_v2") # load the metadata workbook ship_meta_v2.xlsx
meta_data_item <- prep_get_data_frame("item_level") # load the item_level sheet from ship_meta_v2.xlsx
meta_data_segment <- prep_get_data_frame("segment_level") # load the segment_level sheet from ship_meta_v2.xlsx
| STUDY_SEGMENT | SEGMENT_RECORD_COUNT | SEGMENT_ID_REF_TABLE | SEGMENT_RECORD_CHECK | SEGMENT_ID_VARS | SEGMENT_UNIQUE_ROWS | SEGMENT_PART_VARS |
|---|---|---|---|---|---|---|
| INTRO | 2154 | expected_id_segment | exact | id | TRUE | seg_part_intro |
| SOMATOMETRY | 500 | expected_id_segment | exact | id | TRUE | seg_part_somatometry |
| INTERVIEW | 2150 | expected_id_segment | exact | id | TRUE | seg_part_interview |
| LABORATORY | 500 | expected_id_segment | subset | id | TRUE | seg_part_laboratory |
A detailed description of how to set up this metadata file is available here. See here for an overview of dataquieR’s metadata usage.
Some quality indicators at the segment level are related to the Integrity of the data. These are: Unexpected data element set, Unexpected data record count, and Duplicates. These data quality
indicators are implemented in the
int_all_datastructure_dataframe function.
Additionally, one may compute some Completeness indicators: Missing values, Non-response rate and Refusal rate using the
com_segment_missingness function.
See int_all_datastructure_segment’s
documentation, as well as com_segment_missingness
for a detailed explanation of the input and output of the functions.
To evaluate Integrity at the segment level, we apply the function
int_all_datastructure_segment in the following way:
segment_structure <- int_all_datastructure_segment(
study_data = sd1,
meta_data = meta_data_item,
meta_data_segment = meta_data_segment
)
The function returns a nested list with the elements
SegmentTable, SegmentData and
SegmentDataList. SegmentTable is used for
reporting purposes, so the results are abbreviated. Hence, here we focus
on the readable output from SegmentData and
SegmentDataList. SegmentData shows a summary
of all the indicators computed per segment:
segment_structure$SegmentData
| Segment | Unexpected data record count N (%) | Unexpected data record count (Grading) | Unexpected data record set N (%) | Unexpected data record set (Grading) | Duplicates N (%) | Duplicates (Grading) | Unexp. Variables | Unexpected data element set N (%) | Unexpected data element set (Grading) |
|---|---|---|---|---|---|---|---|---|---|
| INTERVIEW | 4 (0.19%) | 1 | 3 (0.07%) | 1 | 0 (0%) | 0 | NA | 0 (0%) | 0 |
| INTRO | 0 (0%) | 0 | 0 (0%) | 0 | 0 (0%) | 0 | NA | 0 (0%) | 0 |
| LABORATORY | 1640 (328%) | 1 | 1642 (62.22%) | 1 | 0 (0%) | 0 | NA | 0 (0%) | 0 |
| SOMATOMETRY | 1653 (330.6%) | 1 | 1654 (62.37%) | 1 | 0 (0%) | 0 | NA | 0 (0%) | 0 |
Note that a Grading is given per indicator and segment
to show whether there are data quality issues (Grading = 1)
or not (Grading = 0).
SegmentDataList contains a more detailed output with six
data frames (one or more per indicator), each with a
Segment column, indicating the name of each part of the
study. The data frames are the following:
segment_structure$SegmentDataList$`Unexpected data record count`
| Segment | Check | Unexpected records | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | GRADING |
|---|---|---|---|---|---|---|---|
| INTRO | Records | FALSE | 2154 | 2154 | 0 | 0.000 | 0 |
| SOMATOMETRY | Records | TRUE | 2153 | 500 | 1653 | 330.600 | 1 |
| INTERVIEW | Records | TRUE | 2154 | 2150 | 4 | 0.186 | 1 |
| LABORATORY | Records | TRUE | 2140 | 500 | 1640 | 328.000 | 1 |
The table reports the level of the check (in this case only records
are relevant), whether unexpected records were found, the number of
records present in the study data, the number of records expected
according to the metadata, and, if unexpected records are detected, the
number and percentage of mismatches. According to this result, a binary
GRADING is also provided. Here, only the INTRO segment
agrees with the expectations provided in the metadata, so
GRADING = 0.
segment_structure$SegmentDataList$Duplicates
| Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING | ID Vars |
|---|---|---|---|---|---|---|
| IDs | INTRO | FALSE | 0 | 0 | 0 | id |
| IDs | SOMATOMETRY | FALSE | 0 | 0 | 0 | id |
| IDs | INTERVIEW | FALSE | 0 | 0 | 0 | id |
| IDs | LABORATORY | FALSE | 0 | 0 | 0 | id |
These results are based on IDs. The columns indicate whether any
duplicates were found, and if so, the number and percentage of
duplicates is reported. Any duplicated entries are also returned in a
vector. According to the result of the assessment, GRADING
= 0 as there are no duplicate IDs in any segment.
segment_structure$SegmentDataList$int_sts_dupl_content
| Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
|---|---|---|---|---|---|
| Duplicate records | INTRO | FALSE | 0 | 0.000 | 0 |
| Duplicate records | SOMATOMETRY | FALSE | 0 | 0.000 | 0 |
| Duplicate records | INTERVIEW | TRUE | 978 | 45.404 | 1 |
| Duplicate records | LABORATORY | FALSE | 0 | 0.000 | 0 |
These results are based on row content (i.e. the uniqueness of rows
in the study data). The columns indicate whether any duplicates were
found, and if so, the number and percentage of duplicates is reported.
Any duplicated entries are also returned in a vector.
GRADING = 0 in this case because there are no
duplicates.
segment_structure$SegmentDataList$`Unexpected data element set`
| Segment | MISSING | Unexpected data element set (Percentage (0 to 100)) | Unexpected data element set (Number) | resp_vars | GRADING |
|---|---|---|---|---|---|
| INTRO | NA | 0 | 0 | NA | 0 |
| SOMATOMETRY | NA | 0 | 0 | NA | 0 |
| INTERVIEW | NA | 0 | 0 | NA | 0 |
| LABORATORY | NA | 0 | 0 | NA | 0 |
If there is an unexpected element set, the column
MISSING indicates whether it is missing from the study data
or the metadata. The next columns show the percentage and number of
unexpected element sets, respectively, while resp_vars
contains the names of the affected elements. According to the presence
of unexpected element sets, a binary GRADING is also
provided to flag the discrepancies.
To evaluate Completeness at the segment level, we apply the functions
com_segment_missingness and
com_qualified_segment_missingness.
Missingness per segment can be calculated using:
seg_miss <- com_segment_missingness(
study_data = sd1,
meta_data = meta_data_item,
meta_data_segment = meta_data_segment,
label_col = "LONG_LABEL"
)
This check identifies subjects without any measurements within each
of the four study segments. The output is a list containing
ResultData and ReportSummaryTable.
ReportSummaryTable is an abbreviated table produced for
reporting purposes. ResultData shows an expanded output
with the number and percentage of missing values per segment:
seg_miss$ResultData
| Group | Examinations | No. Participants | No. missing segments | (%) of missing segments | threshold | direction |
|---|---|---|---|---|---|---|
| 1 | INTRO | 2154 | 0 | 0.00 | 10 | above |
| 1 | SOMATOMETRY | 2154 | 1 | 0.05 | 10 | above |
| 1 | INTERVIEW | 2154 | 0 | 0.00 | 10 | above |
| 1 | LABORATORY | 2154 | 14 | 0.65 | 10 | above |
The result is also displayed as a plot:
seg_miss$ReportSummaryTable

Exploring segment missingness over time requires additional information in the study data. For instance, that specifies the years of the examination:
# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))
Information regarding this new exyear variable can be
added to the metadata using the function
prep_add_to_meta:
# add metadata for this variable
meta_data_item <- prep_add_to_meta(
VAR_NAMES = "exyear",
DATA_TYPE = "integer",
SCALE_LEVEL = "ordinal",
LABEL = "EX_YEAR_0",
VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
VARIABLE_ROLE = "process",
STUDY_SEGMENT = "INTRO",
meta_data = meta_data_item
)
Using this discretized variable for examination year, the occurrence
pattern by year can subsequently be assessed using
com_segment_missingness:
MissSegs <- com_segment_missingness(
study_data = sd1,
meta_data = meta_data_item,
threshold_value = 1,
label_col = LABEL,
group_vars = "EX_YEAR_0",
direction = "high",
exclude_roles = "process"
)
MissSegs$SummaryPlot
## Error in base::tryCatch(base::withCallingHandlers({ :
## Internal error: inherits(r$ReportSummaryTable, "ReportSummaryTable") is
## not TRUE

The plot is a descriptor of the indicator Missing values. It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th years.
A more refined measure of missingness can be obtained when missing codes and intepretations for
these are provided in the metadata. In this example, the column
MISSING_LIST_TABLE in the item level metadata contains the
name of another table where missing codes are given per variable or
segment:
| VAR_NAMES | MISSING_LIST_TABLE |
|---|---|
| id | NA |
| exdate | NA |
| sex | NA |
| age | NA |
| obs_bp | missing_table |
| dev_bp | missing_table |
| sbp1 | NA |
| sbp2 | NA |
| dbp1 | NA |
| dbp2 | NA |
| obs_soma | missing_table |
| height | missing_table |
| dev_length | missing_table |
| weight | missing_table |
| dev_weight | missing_table |
| waist | missing_table |
| obs_int | missing_table |
| school | missing_table |
| family | missing_table |
| smoking | missing_table |
| stroke | missing_table |
| myocard | missing_table |
| diab_known | missing_table |
| diab_age | missing_table |
| contraception | missing_table |
| income | missing_table |
| hdl | missing_table |
| ldl | missing_table |
| cholesterol | missing_table |
| seg_part_intro | segment_missing_table |
| seg_part_somatometry | segment_missing_table |
| seg_part_interview | segment_missing_table |
| seg_part_laboratory | segment_missing_table |
| exyear | NA |
The missing tables contain the following information:
| CODE_VALUE | CODE_LABEL | CODE_INTERPRET | CODE_CLASS |
|---|---|---|---|
| 99800 | JUMP - other reason | NE | JUMP |
| 99801 | JUMP - not applicable | NE | JUMP |
| 99802 | JUMP - design change | O | JUMP |
| 99900 | Missing - other reason | O | MISSING |
| 99901 | Missing - refusal | R | MISSING |
| 99902 | Missing - not assessable | NC | MISSING |
| 99903 | Missing - technical problem | O | MISSING |
| 99904 | Missing - not available (material) | O | MISSING |
| 99905 | Missing - not usable (material) | O | MISSING |
| 99906 | Missing - reason unknown | UO | MISSING |
| 99907 | Missing - optional value | NE | MISSING |
| 99908 | Deleted - other reason | O | MISSING |
| 99909 | Deleted - contradiction | O | MISSING |
| 99910 | Deleted - value outside limits | O | MISSING |
| 99912 | Value above detection limit | O | MISSING |
| 99913 | Value below detection limit | O | MISSING |
| 99914 | Data management ongoing | O | MISSING |
This allows us to calculate missingnes rates using
com_qualified_segment_missingness:
seg_miss_qual <- com_qualified_segment_missingness(
study_data = sd1,
meta_data = meta_data_item,
meta_data_segment = meta_data_segment,
label_col = "LONG_LABEL"
)
The output contains the elements SegmentTable and
SegmentData. SegmentTable is the abbreviated
output for reporting, while SegmentData shows the
missingnes rates:
seg_miss_qual$SegmentData
| Segment | Non-response rate (Percentage (0 to 100)) | Refusal rate (Percentage (0 to 100)) |
|---|---|---|
| INTRO | 0% | 0% |
| SOMATOMETRY | 0.19% | 0.05% |
| INTERVIEW | 0.51% | 0.32% |
| LABORATORY | 0.79% | 0.05% |