Structural data set error includes the indicators: Unexpected data element count, Unexpected data element set, Unexpected data record count, Unexpected data record set, and Duplicates. These data quality indicators can be applied at the data frame level or at the segment level, and they are implemented in the functions int_all_datastructure_dataframe and int_all_datastructure_segment, respectively.

Data frame level

Structural data set error at the data frame level can be assessed using:

# Load dataquieR
library(dataquieR)

# Load data
sd1 <- prep_get_data_frame("ship")

# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_dataframe <- prep_get_data_frame("dataframe_level") # dataframe_level is a another sheet in ship_meta_v2.xlsx

# Apply indicator function
dataframe_structure <- int_all_datastructure_dataframe(
   meta_data_dataframe = meta_data_dataframe,
   meta_data = meta_data_item
)

The function returns a nested list with the elements DataframeTable and DataframeDataList. DataframeTable is used for reporting purposes, so the results are abbreviated. Hence, here we focus on the readable output from DataframeDataList. This list contains six data frames (one or more per indicator indicator), each with a Data frame column, which indicates the name of each study database analyzed. The data frames are:

Unexpected data element count

dataframe_structure$DataframeDataList$`Unexpected data element count`
Check Data frame Unexpected elements Number of elements in data Number of elements in metadata Number of mismatches Percentage of mismatches GRADING
Elements ship TRUE 33 29 4 13.793 1


The columns indicate whether unexpected elements (e.g., variables) were found, the number of elements present in the study data, the number of elements in the metadata, and, if unexpected elements are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided to flag any discrepancy. In this case, GRADING = 1 means that there is a mismatch.

Unexpected data element set

dataframe_structure$DataframeDataList$`Unexpected data element set`
MISSING Unexpected data element set: Percentage (0 to 100) Unexpected data element set: Number resp_vars GRADING Data frame
0 0 0 0 0 ship


If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. Note that the table above shows only zeros because no unexpected elements were identified, so GRADING = 0.

Unexpected data record count

dataframe_structure$DataframeDataList$`Unexpected data record count`
Check Data frame Unexpected records Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches GRADING
Records ship FALSE 2154 2154 0 0 0


The columns indicate the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. Here, there is a perfect match, so GRADING = 0.

Unexpected data record set

dataframe_structure$DataframeDataList$`Unexpected data record set`
Check Data frame Unexpected records in set Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches Expected match type Actual match type GRADING
Record set ship FALSE 2154 2154 0 0 exact exact 0


In this data frame, the columns show the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. In this example, GRADING = 0 because no unexpected records were found.

Duplicates

ID duplicates

dataframe_structure$DataframeDataList$Duplicates
Check Data frame Any duplicates Number of duplicates Percentage of duplicates GRADING
IDs ship FALSE 0 0 0


These results are based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, GRADING = 0 as there are no duplicate IDs.

Row duplicates

dataframe_structure$DataframeDataList$int_sts_dupl_row
Check Data frame Any duplicates Number of duplicates Percentage of duplicates GRADING
Duplicates ship FALSE 0 0 0


These results are based on row content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. GRADING = 0 in this case because there are no duplicates.

Segment level

To evaluate Structural data set error at the segment level, we apply the function int_all_datastructure_segment in the following way:

# Load dataquieR
library(dataquieR)

# Load data
sd1 <- prep_get_data_frame("ship")

# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_segment <- prep_get_data_frame("segment_level") # segment_level is a another sheet in ship_meta_v2.xlsx

# Apply indicator function
segment_structure <- int_all_datastructure_segment(
  study_data = sd1,
  meta_data = meta_data_item,
  meta_data_segment = meta_data_segment
)

The function returns a nested list with the elements SegmentTable, SegmentData and SegmentDataList. SegmentTable is used for reporting purposes, so the results are abbreviated. Hence, here we focus on the readable output from SegmentData and SegmentDataList. SegmentData shows a summary of all the indicators computed per segment:

segment_structure$SegmentData
Segment Unexpected data record count N (%) Unexpected data record count (Grading) Unexpected data record set N (%) Unexpected data record set (Grading) Duplicates N (%) Duplicates (Grading) Unexpected data element set N (%) Unexpected data element set (Grading)
INTERVIEW 4 (0.19%) 1 1 (0.05%) 1 0 (0%) 0 0 (0%) 0
INTRO 0 (0%) 0 1 (0.05%) 1 0 (0%) 0 0 (0%) 0
LABORATORY 1640 (328%) 1 1 (0.2%) 0 0 (0%) 0 0 (0%) 0
SOMATOMETRY 1653 (330.6%) 1 1 (0.2%) 1 0 (0%) 0 0 (0%) 0


Note that a Grading is given per indicator and segment to show whether there are data quality issues (Grading = 1) or not (Grading = 0).

SegmentDataList contains a more detailed output with six data frames (one or more per indicator), each with a Segment column, indicating the name of each part of the study. The data frames are the following:

Unexpected data record count

segment_structure$SegmentDataList$`Unexpected data record count`
Segment Check Unexpected records Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches GRADING
INTRO Records FALSE 2154 2154 0 0.000 0
SOMATOMETRY Records TRUE 2153 500 1653 330.600 1
INTERVIEW Records TRUE 2154 2150 4 0.186 1
LABORATORY Records TRUE 2140 500 1640 328.000 1


The table reports the level of the check (in this case only records are relevant), whether unexpected records were found, the number of records present in the study data, the number of records expected according to the metadata, and, if unexpected records are detected, the number and percentage of mismatches. According to this result, a binary GRADING is also provided. Here, only the INTRO segment agrees with the expectations provided in the metadata, so GRADING = 0.

ID duplicates

segment_structure$SegmentDataList$Duplicates
Check Segment Any duplicates Number of duplicates Percentage of duplicates GRADING
IDs INTRO FALSE 0 0 0
IDs SOMATOMETRY FALSE 0 0 0
IDs INTERVIEW FALSE 0 0 0
IDs LABORATORY FALSE 0 0 0


These results are based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, GRADING = 0 as there are no duplicate IDs in any segment.

Row duplicates

segment_structure$SegmentDataList$int_sts_dupl_content
Check Segment Any duplicates Number of duplicates Percentage of duplicates GRADING
Duplicate records INTRO FALSE 0 0 0
Duplicate records SOMATOMETRY FALSE 0 0 0
Duplicate records INTERVIEW FALSE 0 0 0
Duplicate records LABORATORY FALSE 0 0 0


These results are based on row content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. GRADING = 0 in this case because there are no duplicates.

Unexpected data element set

segment_structure$SegmentDataList$`Unexpected data element set`
Segment MISSING Unexpected data element set: Percentage (0 to 100) Unexpected data element set: Number resp_vars GRADING
INTRO NA 0 0 NA 0
SOMATOMETRY NA 0 0 NA 0
INTERVIEW NA 0 0 NA 0
LABORATORY NA 0 0 NA 0


If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. According to the presence of unexpected element sets, a binary GRADING is also provided to flag the discrepancies.

Back to Example data quality assessment of SHIP data