Introduction

We may be interested in an examination set or a group of examinations within a study for a data quality evaluation. In this case, we refer to the segment level when we use a subset of the data elements in the study database for data quality reporting. Here, for example, we might want to know whether we have the correct subset of participants for a given examination. This tutorial shows how to compute data quality indicators at the segment level using dataquieR.

Example data

To illustrate the functionalities, we use a subset of the example SHIP data that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

The basic workflow is:

library(dataquieR)
sd1 <- prep_get_data_frame("ship")

Required metadata

To evaluate data quality at the segment level, we must provide descriptions and expectations about sections of the study data in dataquieR’s metadata format. For instance, for the SHIP example data, the metadata may be loaded as follows:

prep_load_workbook_like_file("ship_meta_v2") # load the metadata workbook ship_meta_v2.xlsx
meta_data_item <- prep_get_data_frame("item_level") # load the item_level sheet from ship_meta_v2.xlsx
meta_data_segment <- prep_get_data_frame("segment_level") # load the  segment_level sheet from ship_meta_v2.xlsx

STUDY_SEGMENT	SEGMENT_RECORD_COUNT	SEGMENT_ID_REF_TABLE	SEGMENT_RECORD_CHECK	SEGMENT_ID_VARS	SEGMENT_UNIQUE_ROWS	SEGMENT_PART_VARS
INTRO	2154	expected_id_segment	exact	id	TRUE	seg_part_intro
SOMATOMETRY	500	expected_id_segment	exact	id	TRUE	seg_part_somatometry
INTERVIEW	2150	expected_id_segment	exact	id	TRUE	seg_part_interview
LABORATORY	500	expected_id_segment	subset	id	TRUE	seg_part_laboratory

A detailed description of how to set up this metadata file is available here. See here for an overview of dataquieR’s metadata usage.

Data quality indicators at the segment level

Some quality indicators at the segment level are related to the Integrity of the data. These are: Unexpected data element set, Unexpected data record count, and Duplicates. These data quality indicators are implemented in the int_all_datastructure_dataframe function.

Additionally, one may compute some Completeness indicators: Missing values, Non-response rate and Refusal rate using the com_segment_missingness function.

See int_all_datastructure_segment’s documentation, as well as com_segment_missingness for a detailed explanation of the input and output of the functions.

Example usage and output

Integrity

To evaluate Integrity at the segment level, we apply the function int_all_datastructure_segment in the following way:

segment_structure <- int_all_datastructure_segment(
  study_data = sd1,
  meta_data = meta_data_item,
  meta_data_segment = meta_data_segment
)

The function returns a nested list with the elements SegmentTable, SegmentData and SegmentDataList. SegmentTable is used for reporting purposes, so the results are abbreviated. Hence, here we focus on the readable output from SegmentData and SegmentDataList. SegmentData shows a summary of all the indicators computed per segment:

segment_structure$SegmentData

Segment	Unexpected data record count N (%)	Unexpected data record count (Grading)	Unexpected data record set N (%)	Unexpected data record set (Grading)	Unexp. Variables
INTERVIEW	4 (0.19%)	1	3 (0.07%)	1	NA
INTRO	0 (0%)	0	0 (0%)	0	NA
LABORATORY	1640 (328%)	1	1642 (62.22%)	1	NA
SOMATOMETRY	1653 (330.6%)	1	1654 (62.37%)	1	NA

Note that a Grading is given per indicator and segment to show whether there are data quality issues (Grading = 1) or not (Grading = 0).

SegmentDataList contains a more detailed output with six data frames (one or more per indicator), each with a Segment column, indicating the name of each part of the study. The data frames are the following:

Unexpected data record count

segment_structure$SegmentDataList$`Unexpected data record count`

Segment	Check	Unexpected records	Number of records in data	Number of records in metadata	Number of mismatches	Percentage of mismatches	GRADING
INTRO	Records	FALSE	2154	2154	0	0.000	0
SOMATOMETRY	Records	TRUE	2153	500	1653	330.600	1
INTERVIEW	Records	TRUE	2154	2150	4	0.186	1
LABORATORY	Records	TRUE	2140	500	1640	328.000	1

The table reports the level of the check (in this case only records are relevant), whether unexpected records were found, the number of records present in the study data, the number of records expected according to the metadata, and, if unexpected records are detected, the number and percentage of mismatches. According to this result, a binary GRADING is also provided. Here, only the INTRO segment agrees with the expectations provided in the metadata, so GRADING = 0.

ID duplicates

segment_structure$SegmentDataList$Duplicates

Check	Segment	Any duplicates	ID Vars
IDs	INTRO	FALSE	id
IDs	SOMATOMETRY	FALSE	id
IDs	INTERVIEW	FALSE	id
IDs	LABORATORY	FALSE	id

These results are based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, GRADING = 0 as there are no duplicate IDs in any segment.

Row duplicates

segment_structure$SegmentDataList$int_sts_dupl_content

Check	Segment	Any duplicates	Number of duplicates	Percentage of duplicates	GRADING
Duplicate records	INTRO	FALSE	0	0.000	0
Duplicate records	SOMATOMETRY	FALSE	0	0.000	0
Duplicate records	INTERVIEW	TRUE	978	45.404	1
Duplicate records	LABORATORY	FALSE	0	0.000	0

These results are based on row content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. GRADING = 0 in this case because there are no duplicates.

Unexpected data element set

segment_structure$SegmentDataList$`Unexpected data element set`

Segment	MISSING	resp_vars
INTRO	NA	NA
SOMATOMETRY	NA	NA
INTERVIEW	NA	NA
LABORATORY	NA	NA

If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. According to the presence of unexpected element sets, a binary GRADING is also provided to flag the discrepancies.

Completeness

To evaluate Completeness at the segment level, we apply the functions com_segment_missingness and com_qualified_segment_missingness.

Missing values

Missingness per segment can be calculated using:

seg_miss <- com_segment_missingness(
  study_data = sd1,
  meta_data = meta_data_item,
  meta_data_segment = meta_data_segment,
  label_col = "LONG_LABEL"
)

This check identifies subjects without any measurements within each of the four study segments. The output is a list containing ResultData and ReportSummaryTable. ReportSummaryTable is an abbreviated table produced for reporting purposes. ResultData shows an expanded output with the number and percentage of missing values per segment:

seg_miss$ResultData

Group	Examinations	No. Participants	No. missing segments	(%) of missing segments	threshold	direction
1	INTRO	2154	0	0.00	10	above
1	SOMATOMETRY	2154	1	0.05	10	above
1	INTERVIEW	2154	0	0.00	10	above
1	LABORATORY	2154	14	0.65	10	above

The result is also displayed as a plot:

seg_miss$ReportSummaryTable

Exploring segment missingness over time requires additional information in the study data. For instance, that specifies the years of the examination:

# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))

Information regarding this new exyear variable can be added to the metadata using the function prep_add_to_meta:

# add metadata for this variable
meta_data_item <- prep_add_to_meta(
  VAR_NAMES = "exyear", 
  DATA_TYPE = "integer",
  SCALE_LEVEL = "ordinal", 
  LABEL = "EX_YEAR_0",
  VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
  VARIABLE_ROLE = "process",
  STUDY_SEGMENT = "INTRO",
  meta_data = meta_data_item
)

Using this discretized variable for examination year, the occurrence pattern by year can subsequently be assessed using com_segment_missingness:

MissSegs <- com_segment_missingness(
  study_data = sd1, 
  meta_data = meta_data_item, 
  threshold_value = 1, 
  label_col = LABEL,
  group_vars = "EX_YEAR_0",
  direction = "high",
  exclude_roles = "process"
)

MissSegs$SummaryPlot

## Error in base::tryCatch(base::withCallingHandlers({ : 
##   Internal error: inherits(r$ReportSummaryTable, "ReportSummaryTable") is
## not TRUE

The plot is a descriptor of the indicator Missing values. It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th years.

Non-response and refusal rates

A more refined measure of missingness can be obtained when missing codes and intepretations for these are provided in the metadata. In this example, the column MISSING_LIST_TABLE in the item level metadata contains the name of another table where missing codes are given per variable or segment:

VAR_NAMES	MISSING_LIST_TABLE
id	NA
exdate	NA
sex	NA
age	NA
obs_bp	missing_table
dev_bp	missing_table
sbp1	NA
sbp2	NA
dbp1	NA
dbp2	NA
obs_soma	missing_table
height	missing_table
dev_length	missing_table
weight	missing_table
dev_weight	missing_table
waist	missing_table
obs_int	missing_table
school	missing_table
family	missing_table
smoking	missing_table
stroke	missing_table
myocard	missing_table
diab_known	missing_table
diab_age	missing_table
contraception	missing_table
income	missing_table
hdl	missing_table
ldl	missing_table
cholesterol	missing_table
seg_part_intro	segment_missing_table
seg_part_somatometry	segment_missing_table
seg_part_interview	segment_missing_table
seg_part_laboratory	segment_missing_table
exyear	NA

The missing tables contain the following information:

CODE_VALUE	CODE_LABEL	CODE_INTERPRET	CODE_CLASS
99800	JUMP - other reason	NE	JUMP
99801	JUMP - not applicable	NE	JUMP
99802	JUMP - design change	O	JUMP
99900	Missing - other reason	O	MISSING
99901	Missing - refusal	R	MISSING
99902	Missing - not assessable	NC	MISSING
99903	Missing - technical problem	O	MISSING
99904	Missing - not available (material)	O	MISSING
99905	Missing - not usable (material)	O	MISSING
99906	Missing - reason unknown	UO	MISSING
99907	Missing - optional value	NE	MISSING
99908	Deleted - other reason	O	MISSING
99909	Deleted - contradiction	O	MISSING
99910	Deleted - value outside limits	O	MISSING
99912	Value above detection limit	O	MISSING
99913	Value below detection limit	O	MISSING
99914	Data management ongoing	O	MISSING

This allows us to calculate missingnes rates using com_qualified_segment_missingness:

seg_miss_qual <- com_qualified_segment_missingness(
    study_data = sd1, 
    meta_data = meta_data_item, 
    meta_data_segment = meta_data_segment, 
    label_col = "LONG_LABEL"
)

The output contains the elements SegmentTable and SegmentData. SegmentTable is the abbreviated output for reporting, while SegmentData shows the missingnes rates:

seg_miss_qual$SegmentData

Segment	Non-response rate (Percentage (0 to 100))	Refusal rate (Percentage (0 to 100))
INTRO	0%	0%
SOMATOMETRY	0.19%	0.05%
INTERVIEW	0.51%	0.32%
LABORATORY	0.79%	0.05%

Back to Example data quality assessment of SHIP data

Segment level data quality assessment example