The function int_all_datastructure_segment
tests for
unexpected elements and records, as well as duplicated identifiers and
content, at the segment level.
int_all_datastructure_segment
implements indicators for Unexpected data element set, Unexpected data record count, and Duplicates, which belong to the Structural data set error domain in the
Integrity dimension.
For more details, see the user’s manual and source code.
int_all_datastructure_segment(
meta_data_segment = "segment_level",
meta_data = "item_level",
study_data = "study_data")
The function has the following arguments:
VAR_NAMES
, i.e., the column names used in data
frames and known from the metadata.To illustrate the output, we use a subset of the example SHIP data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
segment_structure <- int_all_datastructure_segment(
meta_data_segment = meta_data_segment,
study_data = "ship",
meta_data = "ship_meta"
)
Output 1: DataframeData
int_all_datastructure_dataframe
returns a nested list.
The second element, SegmentDataList
, contains four data
frames that summarize the results using explicit labels. Each data frame
contains a Segment
column, which indicates the name of each
analyzed segment.
The first data frame is Unexpected data record count
,
and presents the results for the Unexpected data record count indicator.
The columns indicate: the level of the check (in this case only records
are relevant), whether unexpected records were found, the number of
records present in the study data, the number of records expected
according to the metadata, and, if unexpected records are detected, the
number and percentage of mismatches is reported. According to this
result, a binary GRADING
is also provided. See the result
using segment_structure$SegmentDataList$
Unexpected data
record count``:
Segment | Check | Unexpected records | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
INTRO | Records | FALSE | 2154 | 2154 | 0 | 0.000 | 0 |
SOMATOMETRY | Records | TRUE | 2154 | 500 | 1654 | 330.800 | 1 |
INTERVIEW | Records | TRUE | 2154 | 2150 | 4 | 0.186 | 1 |
LABORATORY | Records | TRUE | 2140 | 500 | 1640 | 328.000 | 1 |
The next two data frames deal with duplicates. The data frame
Duplicates
returns the result for the Duplicates indicator based on IDs. The
columns indicate whether any duplicates were found, and if so, the
number and percentage of duplicates is reported. Any duplicated entries
are also returned in a vector. According to the result of the
assessment, a binary GRADING
is provided. Get the result
using segment_structure$SegmentDataList$Duplicates
:
Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
IDs | INTRO | FALSE | 0 | 0 | 0 |
IDs | SOMATOMETRY | FALSE | 0 | 0 | 0 |
IDs | INTERVIEW | FALSE | 0 | 0 | 0 |
IDs | LABORATORY | FALSE | 0 | 0 | 0 |
int_sts_dupl_content
, contains the results of the Duplicates indicator based on content
(i.e. the uniqueness of rows in the study data). The columns indicate
whether any duplicates were found, and if so, the number and percentage
of duplicates is reported. Any duplicated entries are also returned in a
vector. According to the result of the assessment, a binary
GRADING
is provided. Use
segment_structure$SegmentDataList$int_sts_dupl_content
to
print the output:
Check | Segment | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
Duplicate records | INTRO | FALSE | 0 | 0 | 0 |
Duplicate records | SOMATOMETRY | FALSE | 0 | 0 | 0 |
Duplicate records | INTERVIEW | FALSE | 0 | 0 | 0 |
Duplicate records | LABORATORY | FALSE | 0 | 0 | 0 |
Please note that both duplicate tables above contain only zeros because no duplicates were identified. A new example that demonstrates better the output of this function will be available soon.
The last data frame, Unexpected data element set
, shows
the results for Unexpected data element
set in each segment. If there is an unexpected element set, the
column MISSING
indicates whether it is missing from the
study data or the metadata. The next columns show the percentage and
number of unexpected element sets, respectively, while
resp_vars
contains the names of the affected elements.
According to the presence of unexpected element sets, a binary
GRADING
is also provided to flag the discrepancies. Use
segment_structure$SegmentDataList$
Unexpected data element
set`` to view the result:
Segment | MISSING | Unexpected data element set: Percentage (0 to 100) | Unexpected data element set: Number | resp_vars | GRADING |
---|---|---|---|---|---|
INTRO | NA | 0 | 0 | NA | 0 |
SOMATOMETRY | NA | 0 | 0 | NA | 0 |
INTERVIEW | NA | 0 | 0 | NA | 0 |
LABORATORY | NA | 0 | 0 | NA | 0 |
As with the duplicate tables, in the table above no unexpected elements were identified. A new example that includes unexpected records will be available soon.
Output 2: SegmentTable
The first output, SegmentTable
, summarizes the above
integrity results per column. dq_report2
uses this data
frame to populate the integrity section of the data quality report;
hence the output is minimal, and the names of the columns are
abbreviations. The columns with the NUM
prefix give the
results in terms of the number of observations, while the
PCT
columns present the percentage. A binary
GRADING
is also provided. Display the result with
segment_structure$SegmentTable
:
Segment | NUM_int_sts_countre | PCT_int_sts_countre | GRADING_int_sts_countre | NUM_int_sts_setrc | PCT_int_sts_setrc | GRADING_int_sts_setrc | NUM_int_sts_dupl_ids | PCT_int_sts_dupl_ids | GRADING_int_sts_dupl_ids | NUM_int_sts_dupl_content | PCT_int_sts_dupl_content | GRADING_int_sts_dupl_content | MISSING | PCT_int_sts_element | NUM_int_sts_element | resp_vars | GRADING_int_sts_element |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INTERVIEW | 4 | 0.186 | 1 | 1 | 0.046 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 0 |
INTRO | 0 | 0.000 | 0 | 1 | 0.046 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 0 |
LABORATORY | 1640 | 328.000 | 1 | 1 | 0.200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 0 |
SOMATOMETRY | 1654 | 330.800 | 1 | 1 | 0.200 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0 | 0 | NA | 0 |
Any discrepancy indicates a data quality quality problem that needs to be investigated. In addition, the higher the discrepancy, the lower the data quality may be.