Description

The function int_all_datastructure_segment tests for unexpected elements and records, as well as duplicated identifiers and content, at the segment level. int_all_datastructure_segment implements indicators for Unexpected data element set, Unexpected data record count, and Duplicates, which belong to the Structural data set error domain in the Integrity dimension.

For more details, see the user’s manual and source code.

Usage and arguments

int_all_datastructure_segment(
  meta_data_segment = "segment_level",
  meta_data = "item_level",
  study_data = "study_data") 

The function has the following arguments:

  • meta_data_segment: mandatory, the data frame that contains the metadata for the segment level.
  • meta_data: mandatory, the data frame that contains metadata attributes of the study data. The metadata data frame is assumed to contain the information from all the segments, this is needed to get the VAR_NAMES, i.e., the column names used in data frames and known from the metadata.
  • study_data: mandatory, the data frame containing the study measurements.

Example output

To illustrate the output, we use a subset of the example SHIP data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

Integrity checks at the data frame level

segment_structure <- int_all_datastructure_segment(
  meta_data_segment = meta_data_segment,
  study_data = "ship",
  meta_data = "ship_meta"
)

Output 1: DataframeData

int_all_datastructure_dataframe returns a nested list. The second element, SegmentDataList, contains four data frames that summarize the results using explicit labels. Each data frame contains a Segment column, which indicates the name of each analyzed segment.

The first data frame is Unexpected data record count, and presents the results for the Unexpected data record count indicator. The columns indicate: the level of the check (in this case only records are relevant), whether unexpected records were found, the number of records present in the study data, the number of records expected according to the metadata, and, if unexpected records are detected, the number and percentage of mismatches is reported. According to this result, a binary GRADING is also provided. See the result using segment_structure$SegmentDataList$Unexpected data record count``:

Segment Check Unexpected records Number of records in data Number of records in metadata Number of mismatches Percentage of mismatches GRADING
INTRO Records FALSE 2154 2154 0 0.000 0
SOMATOMETRY Records TRUE 2154 500 1654 330.800 1
INTERVIEW Records TRUE 2154 2150 4 0.186 1
LABORATORY Records TRUE 2140 500 1640 328.000 1


The next two data frames deal with duplicates. The data frame Duplicates returns the result for the Duplicates indicator based on IDs. The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, a binary GRADING is provided. Get the result using segment_structure$SegmentDataList$Duplicates:

Check Segment Any duplicates Number of duplicates Percentage of duplicates GRADING
IDs INTRO FALSE 0 0 0
IDs SOMATOMETRY FALSE 0 0 0
IDs INTERVIEW FALSE 0 0 0
IDs LABORATORY FALSE 0 0 0


int_sts_dupl_content, contains the results of the Duplicates indicator based on content (i.e. the uniqueness of rows in the study data). The columns indicate whether any duplicates were found, and if so, the number and percentage of duplicates is reported. Any duplicated entries are also returned in a vector. According to the result of the assessment, a binary GRADING is provided. Use segment_structure$SegmentDataList$int_sts_dupl_content to print the output:

Check Segment Any duplicates Number of duplicates Percentage of duplicates GRADING
Duplicate records INTRO FALSE 0 0 0
Duplicate records SOMATOMETRY FALSE 0 0 0
Duplicate records INTERVIEW FALSE 0 0 0
Duplicate records LABORATORY FALSE 0 0 0


Please note that both duplicate tables above contain only zeros because no duplicates were identified. A new example that demonstrates better the output of this function will be available soon.

The last data frame, Unexpected data element set, shows the results for Unexpected data element set in each segment. If there is an unexpected element set, the column MISSING indicates whether it is missing from the study data or the metadata. The next columns show the percentage and number of unexpected element sets, respectively, while resp_vars contains the names of the affected elements. According to the presence of unexpected element sets, a binary GRADING is also provided to flag the discrepancies. Use segment_structure$SegmentDataList$Unexpected data element set`` to view the result:

Segment MISSING Unexpected data element set: Percentage (0 to 100) Unexpected data element set: Number resp_vars GRADING
INTRO NA 0 0 NA 0
SOMATOMETRY NA 0 0 NA 0
INTERVIEW NA 0 0 NA 0
LABORATORY NA 0 0 NA 0


As with the duplicate tables, in the table above no unexpected elements were identified. A new example that includes unexpected records will be available soon.

Output 2: SegmentTable

The first output, SegmentTable, summarizes the above integrity results per column. dq_report2 uses this data frame to populate the integrity section of the data quality report; hence the output is minimal, and the names of the columns are abbreviations. The columns with the NUM prefix give the results in terms of the number of observations, while the PCT columns present the percentage. A binary GRADING is also provided. Display the result with segment_structure$SegmentTable:

Segment NUM_int_sts_countre PCT_int_sts_countre GRADING_int_sts_countre NUM_int_sts_setrc PCT_int_sts_setrc GRADING_int_sts_setrc NUM_int_sts_dupl_ids PCT_int_sts_dupl_ids GRADING_int_sts_dupl_ids NUM_int_sts_dupl_content PCT_int_sts_dupl_content GRADING_int_sts_dupl_content MISSING PCT_int_sts_element NUM_int_sts_element resp_vars GRADING_int_sts_element
INTERVIEW 4 0.186 1 1 0.046 1 0 0 0 0 0 0 NA 0 0 NA 0
INTRO 0 0.000 0 1 0.046 1 0 0 0 0 0 0 NA 0 0 NA 0
LABORATORY 1640 328.000 1 1 0.200 0 0 0 0 0 0 0 NA 0 0 NA 0
SOMATOMETRY 1654 330.800 1 1 0.200 1 0 0 0 0 0 0 NA 0 0 NA 0


Interpretation

Any discrepancy indicates a data quality quality problem that needs to be investigated. In addition, the higher the discrepancy, the lower the data quality may be.

Algorithm of the implementation

  1. Compare the study data frames with the information provided in the metadata.
  2. Return the output in two summary tables, one with user-friendly descriptions and another one with concise names only to be used during the report generation.

Concept relations