Description

The function int_all_datastructure_dataframe tests for unexpected elements and records, as well as duplicated identifiers and content, at the data frame level. The unexpected element record check can be conducted by providing the number of expected records or an additional table with the expected records. It is possible to conduct the checks by study segments or to consider only selected segments. int_all_datastructure_dataframe implements indicators for Unexpected data element count, Unexpected data element set, Unexpected data record count, Unexpected data record set, and Duplicates, which belong to the Structural data set error domain in the Integrity dimension.

For more details, see the user’s manual and source code.

Usage and arguments

int_all_datastructure_dataframe(
  meta_data_dataframe = "dataframe_level",
  meta_data = "item_level") 

The function has the following arguments:

  • meta_data_dataframe: mandatory, the data frame that contains the metadata for the data frame level.
  • meta_data: mandatory, the data frame that contains metadata attributes of the study data. The metadata data frame is assumed to contain the information from all the studies, this is needed to get the VAR_NAMES, i.e., the column names used in data frames and known from the metadata.

Example output

To illustrate the output, we use a subset of the example SHIP data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

Integrity checks at the data frame level

dataframe_structure <- int_all_datastructure_dataframe(
   meta_data_dataframe = meta_data_dataframe,
   meta_data = "ship_meta_v2"
)

Output 1: DataframeDataList

int_all_datastructure_dataframe returns a nested list. The second element, DataframeDataList, contains six data frames that summarize the results using explicit labels. Each data frame contains a Data frame column, which indicates the name of each study data frame analyzed.

The first data frame is Unexpected data element count, which comprises the results for Unexpected data element count in the study data frames. The columns indicate whether unexpected elements were found, the number of elements present in the study data, the number of elements in the metadata, and, if unexpected elements are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided to flag any discrepancy. Get the result using dataframe_structure$DataframeDataList$Unexpected data element count``:

Check Data frame Unexpected elements Number of elements in data Number of elements in metadata Number of mismatches Percentage of mismatches GRADING
Elements ship TRUE 33 29 4 13.793 1


The second data frame, Unexpected data element set, shows the results for Unexpected data element set in the study data frames. If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. According to the presence of unexpected element sets, a binary GRADING is also provided to flag the discrepancies. Use dataframe_structure$DataframeDataList$Unexpected data element set`` to view the result:

MISSING Unexpected data element set: Percentage (0 to 100) Unexpected data element set: Number resp_vars GRADING Data frame
0 0 0 0 0 ship


Please note that the table above is empty because no unexpected elements were identified. A new example that demonstrates better the output of this function will be available soon.

The data frame Unexpected data record count presents the results for the Unexpected data record count indicator. The columns indicate the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided. See the result using dataframe_structure$DataframeDataList$Unexpected data record count``:

Check Data frame Unexpected records Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches GRADING
Records ship FALSE 2154 2154 0 0 0


The next data frame, Unexpected data record set, returns the output of the indicator for Unexpected data record set. In this data frame, the columns show the number of records expected according to the metadata, the actual number of records present in the study data, and, if unexpected records are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided. Use dataframe_structure$DataframeDataList$Unexpected data record set`` to view the output:

Check Data frame Unexpected records in set Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches Expected match type Actual match type GRADING
Record set ship FALSE 2154 2154 0 0 exact exact 0


The data frame Duplicates returns the result for the Duplicates indicator based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, a binary GRADING is provided. Get the result using dataframe_structure$DataframeDataList$Duplicates:

Check Data frame Any duplicates Number of duplicates Percentage of duplicates GRADING
IDs ship FALSE 0 0 0


The last data frame, int_sts_dupl_row, contains the results of the Duplicates indicator based on content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, a binary GRADING is provided. Use dataframe_structure$DataframeDataList$int_sts_dupl_row to print the output:

Check Data frame Any duplicates Number of duplicates Percentage of duplicates GRADING
Duplicates ship FALSE 0 0 0


Output 2: DataframeTable

The first output, DataframeTable, summarizes the above integrity results per column. dq_report2 uses this data frame to populate the integrity section of the data quality report; hence the output is minimal, and the names of the columns are abbreviations. The columns with the NUM prefix give the results in terms of the number of observations, while the PCT columns present the percentage. A binary GRADING is also provided. Display the result with dataframe_structure$DataframeTable:

DF_NAME NUM_int_sts_countel PCT_int_sts_countel GRADING_int_sts_countel MISSING PCT_int_sts_element NUM_int_sts_element resp_vars GRADING_int_sts_element NUM_int_sts_countre PCT_int_sts_countre GRADING_int_sts_countre NUM_int_sts_setrc PCT_int_sts_setrc GRADING_int_sts_setrc NUM_int_sts_dupl_ids PCT_int_sts_dupl_ids GRADING_int_sts_dupl_ids NUM_int_sts_dupl_content PCT_int_sts_dupl_content GRADING_int_sts_dupl_row Level
ship 4 13.793 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Dataframe


Interpretation

Any discrepancy indicates a data quality quality problem that needs to be investigated. In addition, the higher the discrepancy, the lower the data quality may be.

Algorithm of the implementation

  1. Compare the study data frames with the information provided in the metadata.
  2. Return the output in two summary tables, one with user-friendly descriptions and another one with concise names only to be used during the report generation.

Concept relations