The function int_all_datastructure_dataframe
tests for
unexpected elements and records, as well as duplicated identifiers and
content, at the data frame level. The unexpected element record check
can be conducted by providing the number of expected records or an
additional table with the expected records. It is possible to conduct
the checks by study segments or to consider only selected segments.
int_all_datastructure_dataframe
implements indicators for
Unexpected data element count, Unexpected data element set, Unexpected data record count, Unexpected data record set, and Duplicates, which belong to the Structural data set error domain in the
Integrity dimension.
For more details, see the user’s manual and source code.
int_all_datastructure_dataframe(
meta_data_dataframe = "dataframe_level",
meta_data = "item_level")
The function has the following arguments:
VAR_NAMES
, i.e., the column names used in data
frames and known from the metadata.To illustrate the output, we use a subset of the example SHIP data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.
dataframe_structure <- int_all_datastructure_dataframe(
meta_data_dataframe = meta_data_dataframe,
meta_data = "ship_meta_v2"
)
Output 1: DataframeDataList
int_all_datastructure_dataframe
returns a nested list.
The second element, DataframeDataList
, contains six data
frames that summarize the results using explicit labels. Each data frame
contains a Data frame
column, which indicates the name of
each study data frame analyzed.
The first data frame is Unexpected data element count
,
which comprises the results for Unexpected data element count in the
study data frames. The columns indicate whether unexpected elements were
found, the number of elements present in the study data, the number of
elements in the metadata, and, if unexpected elements are detected, the
number and percentage of mismatches is reported. According to this
result, a binary GRADING
is also provided to flag any
discrepancy. Get the result using
dataframe_structure$DataframeDataList$
Unexpected data
element count``:
Check | Data frame | Unexpected elements | Number of elements in data | Number of elements in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
Elements | ship | TRUE | 33 | 29 | 4 | 13.793 | 1 |
The second data frame, Unexpected data element set
,
shows the results for Unexpected data
element set in the study data frames. If there is an unexpected
element set, the column MISSING
indicates whether it is
missing from the study data or the metadata. The next columns show the
percentage and number of unexpected element sets, respectively, while
resp_vars
contains the names of the affected elements.
According to the presence of unexpected element sets, a binary
GRADING
is also provided to flag the discrepancies. Use
dataframe_structure$DataframeDataList$
Unexpected data
element set`` to view the result:
MISSING | Unexpected data element set: Percentage (0 to 100) | Unexpected data element set: Number | resp_vars | GRADING | Data frame |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | ship |
Please note that the table above is empty because no unexpected elements were identified. A new example that demonstrates better the output of this function will be available soon.
The data frame Unexpected data record count
presents the
results for the Unexpected data record
count indicator. The columns indicate the number of records expected
according to the metadata, the actual number of records present in the
study data, and, if unexpected records are detected, the number and
percentage of mismatches is reported. According to this result, a binary
GRADING
is also provided. See the result using
dataframe_structure$DataframeDataList$
Unexpected data
record count``:
Check | Data frame | Unexpected records | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | GRADING |
---|---|---|---|---|---|---|---|
Records | ship | FALSE | 2154 | 2154 | 0 | 0 | 0 |
The next data frame, Unexpected data record set
, returns
the output of the indicator for Unexpected data record set. In this data
frame, the columns show the number of records expected according to the
metadata, the actual number of records present in the study data, and,
if unexpected records are detected, the number and percentage of
mismatches is reported. According to this result, a binary
GRADING
is also provided. Use
dataframe_structure$DataframeDataList$
Unexpected data
record set`` to view the output:
Check | Data frame | Unexpected records in set | Number of records in data | Number of records in metadata | Number of mismatches | Percentage of mismatches | Expected match type | Actual match type | GRADING |
---|---|---|---|---|---|---|---|---|---|
Record set | ship | FALSE | 2154 | 2154 | 0 | 0 | exact | exact | 0 |
The data frame Duplicates
returns the result for the Duplicates indicator based on IDs. The
columns indicate whether any duplicates were found, and if so, the
number and percentage of duplicates is reported. Any duplicated entries
are also returned in a vector. According to the result of the
assessment, a binary GRADING
is provided. Get the result
using dataframe_structure$DataframeDataList$Duplicates
:
Check | Data frame | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
IDs | ship | FALSE | 0 | 0 | 0 |
The last data frame, int_sts_dupl_row
, contains the
results of the Duplicates indicator
based on content (i.e. the uniqueness of rows in the study data). The
columns indicate whether any duplicates were found, and if so, the
number and percentage of duplicates is reported. Any duplicated entries
are also returned in a vector. According to the result of the
assessment, a binary GRADING
is provided. Use
dataframe_structure$DataframeDataList$int_sts_dupl_row
to
print the output:
Check | Data frame | Any duplicates | Number of duplicates | Percentage of duplicates | GRADING |
---|---|---|---|---|---|
Duplicates | ship | FALSE | 0 | 0 | 0 |
Output 2: DataframeTable
The first output, DataframeTable
, summarizes the above
integrity results per column. dq_report2
uses this data
frame to populate the integrity section of the data quality report;
hence the output is minimal, and the names of the columns are
abbreviations. The columns with the NUM
prefix give the
results in terms of the number of observations, while the
PCT
columns present the percentage. A binary
GRADING
is also provided. Display the result with
dataframe_structure$DataframeTable
:
DF_NAME | NUM_int_sts_countel | PCT_int_sts_countel | GRADING_int_sts_countel | MISSING | PCT_int_sts_element | NUM_int_sts_element | resp_vars | GRADING_int_sts_element | NUM_int_sts_countre | PCT_int_sts_countre | GRADING_int_sts_countre | NUM_int_sts_setrc | PCT_int_sts_setrc | GRADING_int_sts_setrc | NUM_int_sts_dupl_ids | PCT_int_sts_dupl_ids | GRADING_int_sts_dupl_ids | NUM_int_sts_dupl_content | PCT_int_sts_dupl_content | GRADING_int_sts_dupl_row | Level |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ship | 4 | 13.793 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Dataframe |
Any discrepancy indicates a data quality quality problem that needs to be investigated. In addition, the higher the discrepancy, the lower the data quality may be.