Structural data set error includes
the indicators: Unexpected data element
count, Unexpected data element
set, Unexpected data record
count, Unexpected data record
set, and Duplicates. These data
quality indicators can be applied at the data frame level or at the
segment level, and they are implemented in the functions
int_all_datastructure_dataframe
and
int_all_datastructure_segment
, respectively.
Structural data set error at the data frame level can be assessed using:
# Load dataquieR
library(dataquieR)
# Load data
sd1 <- prep_get_data_frame("ship")
# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_dataframe <- prep_get_data_frame("dataframe_level") # dataframe_level is a another sheet in ship_meta_v2.xlsx
# Apply indicator function
dataframe_structure <- int_all_datastructure_dataframe(
meta_data_dataframe = meta_data_dataframe,
meta_data = meta_data_item
)
The function returns a nested list with the elements
DataframeTable
and DataframeDataList
.
DataframeTable
is used for reporting purposes, so the
results are abbreviated. Hence, here we focus on the readable output
from DataframeDataList
. This list contains six data frames
(one or more per indicator indicator), each with a
Data frame
column, which indicates the name of each study
database analyzed. The data frames are:
dataframe_structure$DataframeDataList$`Unexpected data element count`
Check | Data frame | Unexpected elements | Number of elements in data | Number of elements in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
Elements | ship | TRUE | 33 | 29 | 4 | 13.793 | 1 |
The columns indicate whether unexpected elements (e.g., variables)
were found, the number of elements present in the study data, the number
of elements in the metadata, and, if unexpected elements are detected,
the number and percentage of mismatches is reported. According to this
result, a binary GRADING
is also provided to flag any
discrepancy. In this case, GRADING
= 1 means that there is
a mismatch.
dataframe_structure$DataframeDataList$`Unexpected data element set`
MISSING | Unexpected data element set: Percentage (0 to 100) | Unexpected data element set: Number | resp_vars | GRADING | Data frame |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | ship |
If there is an unexpected element set, the column
MISSING
indicates whether it is missing from the study data
or the metadata. The next columns show the percentage and number of
unexpected element sets, respectively, while resp_vars
contains the names of the affected elements. Note that the table above
shows only zeros because no unexpected elements were identified, so
GRADING
= 0.
dataframe_structure$DataframeDataList$`Unexpected data record count`
Check | Data frame | Unexpected records | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
Records | ship | FALSE | 2154 | 2154 | 0 | 0 | 0 |
The columns indicate the number of records expected according to the
metadata, the actual number of records present in the study data, and,
if unexpected records are detected, the number and percentage of
mismatches is reported. Here, there is a perfect match, so
GRADING
= 0.
dataframe_structure$DataframeDataList$`Unexpected data record set`
Check | Data frame | Unexpected records in set | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | Expected match type | Actual match type | GRADING |
---|---|---|---|---|---|---|---|---|---|
Record set | ship | FALSE | 2154 | 2154 | 0 | 0 | exact | exact | 0 |
In this data frame, the columns show the number of records expected
according to the metadata, the actual number of records present in the
study data, and, if unexpected records are detected, the number and
percentage of mismatches is reported. In this example,
GRADING
= 0 because no unexpected records were found.
dataframe_structure$DataframeDataList$Duplicates
Check | Data frame | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
IDs | ship | FALSE | 0 | 0 | 0 |
These results are based on IDs. The columns indicate whether any
duplicates were found, and if so, the number and percentage of
duplicates is reported. Any duplicated entries are also returned in a
vector. According to the result of the assessment, GRADING
= 0 as there are no duplicate IDs.
dataframe_structure$DataframeDataList$int_sts_dupl_row
Check | Data frame | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
Duplicates | ship | FALSE | 0 | 0 | 0 |
These results are based on row content (i.e. the uniqueness of rows
in the study data). The columns indicate whether any duplicates were
found, and if so, the number and percentage of duplicates is reported.
Any duplicated entries are also returned in a vector.
GRADING
= 0 in this case because there are no
duplicates.
To evaluate Structural data set
error at the segment level, we apply the function
int_all_datastructure_segment
in the following way:
# Load dataquieR
library(dataquieR)
# Load data
sd1 <- prep_get_data_frame("ship")
# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
meta_data_segment <- prep_get_data_frame("segment_level") # segment_level is a another sheet in ship_meta_v2.xlsx
# Apply indicator function
segment_structure <- int_all_datastructure_segment(
study_data = sd1,
meta_data = meta_data_item,
meta_data_segment = meta_data_segment
)
The function returns a nested list with the elements
SegmentTable
, SegmentData
and
SegmentDataList
. SegmentTable
is used for
reporting purposes, so the results are abbreviated. Hence, here we focus
on the readable output from SegmentData
and
SegmentDataList
. SegmentData
shows a summary
of all the indicators computed per segment:
segment_structure$SegmentData
Segment | Unexpected data record count N (%) | Unexpected data record count (Grading) | Unexpected data record set N (%) | Unexpected data record set (Grading) | Duplicates N (%) | Duplicates (Grading) | Unexpected data element set N (%) | Unexpected data element set (Grading) |
---|---|---|---|---|---|---|---|---|
INTERVIEW | 4 (0.19%) | 1 | 1 (0.05%) | 1 | 0 (0%) | 0 | 0 (0%) | 0 |
INTRO | 0 (0%) | 0 | 1 (0.05%) | 1 | 0 (0%) | 0 | 0 (0%) | 0 |
LABORATORY | 1640 (328%) | 1 | 1 (0.2%) | 0 | 0 (0%) | 0 | 0 (0%) | 0 |
SOMATOMETRY | 1653 (330.6%) | 1 | 1 (0.2%) | 1 | 0 (0%) | 0 | 0 (0%) | 0 |
Note that a Grading
is given per indicator and segment
to show whether there are data quality issues (Grading
= 1)
or not (Grading
= 0).
SegmentDataList
contains a more detailed output with six
data frames (one or more per indicator), each with a
Segment
column, indicating the name of each part of the
study. The data frames are the following:
segment_structure$SegmentDataList$`Unexpected data record count`
Segment | Check | Unexpected records | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
INTRO | Records | FALSE | 2154 | 2154 | 0 | 0.000 | 0 |
SOMATOMETRY | Records | TRUE | 2153 | 500 | 1653 | 330.600 | 1 |
INTERVIEW | Records | TRUE | 2154 | 2150 | 4 | 0.186 | 1 |
LABORATORY | Records | TRUE | 2140 | 500 | 1640 | 328.000 | 1 |
The table reports the level of the check (in this case only records
are relevant), whether unexpected records were found, the number of
records present in the study data, the number of records expected
according to the metadata, and, if unexpected records are detected, the
number and percentage of mismatches. According to this result, a binary
GRADING
is also provided. Here, only the INTRO segment
agrees with the expectations provided in the metadata, so
GRADING
= 0.
segment_structure$SegmentDataList$Duplicates
Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
IDs | INTRO | FALSE | 0 | 0 | 0 |
IDs | SOMATOMETRY | FALSE | 0 | 0 | 0 |
IDs | INTERVIEW | FALSE | 0 | 0 | 0 |
IDs | LABORATORY | FALSE | 0 | 0 | 0 |
These results are based on IDs. The columns indicate whether any
duplicates were found, and if so, the number and percentage of
duplicates is reported. Any duplicated entries are also returned in a
vector. According to the result of the assessment, GRADING
= 0 as there are no duplicate IDs in any segment.
segment_structure$SegmentDataList$int_sts_dupl_content
Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
Duplicate records | INTRO | FALSE | 0 | 0 | 0 |
Duplicate records | SOMATOMETRY | FALSE | 0 | 0 | 0 |
Duplicate records | INTERVIEW | FALSE | 0 | 0 | 0 |
Duplicate records | LABORATORY | FALSE | 0 | 0 | 0 |
These results are based on row content (i.e. the uniqueness of rows
in the study data). The columns indicate whether any duplicates were
found, and if so, the number and percentage of duplicates is reported.
Any duplicated entries are also returned in a vector.
GRADING
= 0 in this case because there are no
duplicates.
segment_structure$SegmentDataList$`Unexpected data element set`
Segment | MISSING | Unexpected data element set: Percentage (0 to 100) | Unexpected data element set: Number | resp_vars | GRADING |
---|---|---|---|---|---|
INTRO | NA | 0 | 0 | NA | 0 |
SOMATOMETRY | NA | 0 | 0 | NA | 0 |
INTERVIEW | NA | 0 | 0 | NA | 0 |
LABORATORY | NA | 0 | 0 | NA | 0 |
If there is an unexpected element set, the column
MISSING
indicates whether it is missing from the study data
or the metadata. The next columns show the percentage and number of
unexpected element sets, respectively, while resp_vars
contains the names of the affected elements. According to the presence
of unexpected element sets, a binary GRADING
is also
provided to flag the discrepancies.