This document describes the generation of a quality report using a 50% random data sample from the Study of Health in Pomerania START cohort, baseline examination (SHIP-START-0, 1997-2001). For further information on this study please see Völzke et al. 2010. Some noise has been introduced to the data to secure anonymity and for illustrative purposes.

Integrity

The first step in the data quality assessment workflow evaluates the compliance of the study data to the respective metadata regarding formal and structural requirements. Users must provide both data and metadata as data frames.

Note: The metadata file is the primary point of reference for generating data quality reports:

  1. It defines the number of variables for which to generate reports.
  2. It is the expected truth against which the study data are compared

Study data

In this example, the SHIP data are loaded from the dataquieR package:

sd1 <- readRDS(system.file("extdata", "ship.RDS", package = "dataquieR"))

The imported study data consists of:

  • N = 2154 observations, and
  • P = 29 study variables.

Metadata

Similarly, the respective metadata is loaded from dataquieR:

metadata_file <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(metadata_file)
md1 <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx

The imported metadata provide information for:

  • P = 29 study variables and
  • Q = 31 variable level attributes.

An identical number of variables in both files is desirable but not necessary. Attributes (i.e. columns in the metadata) comprise information on each variable of the study data file, such as labels or admissibility limits.

Integrity

The integrity check starts by calling the function pro_applicability_matrix(). The data quality indicators covered by this function are:

pro_applicability_matrix() generates a heatmap-like plot for the applicability of all dataquieR functions to the study data, using the provided metadata as a point of reference:

appmatrix <- pro_applicability_matrix(study_data = sd1, 
                                      meta_data = md1, 
                                      label_col = LABEL, 
                                      split_segments = TRUE)

The heatmap can be retrieved by the command:

appmatrix$ApplicabilityPlot

As split_segments = TRUE was used as an argument, all output is organized by the study segments defined in the metadata. In this case, there are data from four examination segments: the computer-assisted interview, intro (basic information on the participants, such as sociodemographic information and examination date), laboratory analysis, and somatometric examination. The assignment of variables to segments is done in the metadata file.

The applicability checks results are technical, i.e., the function compares, for example, the data type as defined in the metadata with those observed in the study data. The light blue areas indicate that additional checks would be possible for many variables if additional metadata would be provided.

Note: Applying all technically feasible data quality implementations to all study data variables is not advisable. For example, detection limits are not meaningful for participants` IDs. However, the variable ID is represented as an integer, which technically allows checking detection limits.

Solving integrity issues

All datatype issues found by pro_applicability_matrix() should be checked data element by data element. For instance, a major issue was found in the variable WAIST_CIRC_0. This variable is represented in the study data with datatype character, which differs from the expected datatype float defined in the metadata. Some basic checks show the misuse of comma as the decimal delimiter.

To correct this issue, the conversion of WAIST_CIRC_0 to datatype numeric will coerce respective values to NA, which should be avoided. Hence, we replace the comma with the correct delimiter and corrected the datatype without losing data values. The resulting applicability plot shows no more issues.

# replace comma with the correct delimiter
sd1$waist <- as.numeric(gsub(",", ".", sd1$waist))

pro_applicability_matrix(study_data = sd1, meta_data = md1, label_col = LABEL)$ApplicabilityPlot

Completeness

The next major step in the data quality assessment workflow is to assess the occurrence and patterns of missing data. The sequence of checks in this example is ordered according to common stages of a data collection:

Level Description
Unit missingness Subjects without information on any of the provided study variables
Segment missingness Subjects without information for all variables on a defined study segment (e.g., some examination)
Item missingness Subjects without information on data fields within segments

Following this sequence enables calculating the correct denominators to compute item missingness. Such calculations are particularly important for complex cohort studies in which different levels of examination programs are conducted. For example, only half of a study population might be selected for an MRI examination. In the remaining 50%, the respective MRI variables are not included per study design. This should be considered if item missingness is examined.

Unit missingness

This check identifies subjects without any measurements on the provided target variables for a data quality check.

Note: The interpretation of findings depends on the scope of the provided variables and data records. In this example, the study data set comprises examined SHIP participants, not the target sample. Accordingly, the check is not about study participation. Rather, it identifies cases for which unexpectedly no information has been provided at all. Any identified case would indicate a data management problem.

The indicator covered by my_unit_missings2() is:

  • DQI-2001 Missing values with an implementation at the level “Units”

Unit missingness can be assessed with:

my_unit_missings2 <- com_unit_missingness(study_data  = sd1, 
                                          meta_data   = md1,
                                          label_col   = LABEL,
                                          id_vars     = "ID")
my_unit_missings2$SummaryData

In total 0 units have missings in all variables of the study data. Thus, for each participant there is at least one variable with information.

Segment missingness

Subsequently, a check is performed that identifies subjects without any measurements within each of the four defined study segments.

The indicator covered by com_segment_missingness() is:

  • DQI-2001 Missing values with an implementation at the level “Segments”

The table output can be retrieved with:

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    threshold_value = 1, 
                                    direction = "high",
                                    exclude_roles = c("secondary", "process"))

MissSegs$SummaryData

Exploring segment missingness over time requires another variable in the study data. Information regarding this variable can be added to the metadata using the dataquieR function prep_add_to_meta():

# create a discretized version of the examination year
sd1$exyear <- as.integer(lubridate::year(sd1$exdate))

# add metadata for this variable
md1 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exyear", 
                                   DATA_TYPE = "integer",
                                   LABEL = "EX_YEAR_0",
                                   VALUE_LABELS = "1997 = 1st | 1998 = 2nd | 1999 = 3rd | 2000 = 4th | 2001 = 5th",
                                   VARIABLE_ROLE = "process",
                                   meta_data = md1)

With a discretized variable for examination year (EX_YEAR_0) the occurrence pattern by year can subsequently be assessed using the command com_segment_missingness():

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    threshold_value = 1, 
                                    label_col = LABEL,
                                    group_vars = "EX_YEAR_0",
                                    direction = "high",
                                    exclude_roles = "process")

MissSegs$SummaryPlot

The plot is a descriptor of the indicator:

  • DQI-2001 Missing values with an implementation at the level “Segments”

It illustrates that missing information from the laboratory examination is distributed unequally across examination years, with the highest proportion of missing data occurring in the 1st, 2nd, and 5th years.

Item missingness

The final check in the completeness dimension identifies subjects with missing information in variables of all study segments. The covered indicators by the function com_item_missingness() are:

  • DQI-1008 Uncertain missingness status
  • DQI-2001 Missing values with an implementation at the level “Item”
  • DQI-2005 Missing due to specified reason

Item missingness can be assessed by using the following call:

item_miss <- com_item_missingness(study_data      = sd1, 
                                  meta_data       = md1, 
                                  show_causes     = TRUE, 
                                  label_col       = "LABEL",
                                  include_sysmiss = TRUE, 
                                  threshold_value = 95
                                ) 

Summary table

A result overview can be obtained by requesting a summary table of this function:

item_miss$SummaryTable

The table provides one line for each of the 29 variables. Of particular interest are:

  • System missings N : the number of data fields for each variable without any valid data entry, indicating a technically inferior coding (DQI-1008).
  • Missing Codes: the number of data fields with valid missing codes.
  • Jump codes: data fields, for which no data collection was attempted.
  • Measurements: provides an inverse of DQI-2001 Missing values with an implementation at the level “Items”.

The table shows that the variable HOUSE_INCOME_MONTH_0 (monthly net household income) is affected by many missing values. In addition, age of diabetes onset (DIAB_AGE_ONSET_0) was only coded for 173 subjects, but most values are missing because of an intended jump.

Note: In case that jump codes have been used, e.g., for the variable CONTRACEPTIVE_EVER_0, the denominator for the calculation of item missingness is corrected for the amount of jump codes used.

Summary plot

The summary plot delivers a different view of missing data by providing the frequency of the specified reasons for missing data. The corresponding indicator is:

  • DQI-2005 Missing due to specified reason.
item_miss$SummaryPlot

In the plot, the balloon size is determined by the number of missing data fields. It can now be inferred that, for example, the elevated number of missing values for the item HOUSE_INCOME_MONTH_0 is mainly caused by refusals of participants to answer the respective question.

Consistency

Consistency is targeted after completeness has been examined because it requires data without missing and jump codes. Consistency (a main aspect of correctness), describes the degree to which data values are free of breaks in conventions or contradictions. Different data types may be addressed in respective checks.

Inadmissible numerical values

The indicator covered by con_limit_deviations() when specifying limits = "HARD_LIMITS" is:

  • DQI-3001 Inadmissible numerical values

Note: When specifying limits = "SOFT_LIMITS" the check does not identify inadmissible but uncertain values, according to the specified ranges. The related indicator is then:

The call in this example is:

MyValueLimits <- con_limit_deviations(study_data = sd1,
                                      meta_data  = md1,
                                      label_col  = "LABEL",
                                      limits     = "HARD_LIMITS")

Summary data

A table output provides the number and percentage of all range violations for the variables with limits specified in the metadata:

MyValueLimits$SummaryData

The last column of the table also provides a GRADING. If the percentage of violations is above some threshold, a GRADING of 1 is assigned. In this case any occurrence is classified as problematic. Otherwise the GRADING is 0.

The following statement assigns all variables identified as problematic to the R object whichdeviate to enable a more targeted output, for example to plot the distributions for any variable with violations along the specified limits:

# select variables with deviations
whichdeviate <- as.character(MyValueLimits$SummaryTable$Variables)[MyValueLimits$SummaryTable$GRADING == 1]

Summary plot

We can restrict the plots to those where variables have limit deviations, i.e., those with a GRADING of 1 in the table above, using MyValueLimits$SummaryPlotList[whichdeviate] (only the first two are displayed below to reduce file size):

Inadmissible categorical values

A comparable check may be performed for categorical variables using the command con_inadmissible_categorical():

The covered indicator is:

  • DQI-3003 Inadmissible categorical values

The call is:

IAVCatAll <- con_inadmissible_categorical(study_data = sd1, 
                                          meta_data  = md1, 
                                          label_col  = "LABEL")

As with inadmissible numerical values, a table output displays the observed categories, the defined categories, any non matching level, its count, and a GRADING:

IAVCatAll$SummaryData

The results show that there is one variable, SCHOOL_GRAD_0, with one inadmissible level occurring.

Contradictions

The second main type of checks within the consistency dimension concerns contradictions. The covered indicators by the functions con_contradictions_redcap() are:

The rules to identify contradictions must first be uploaded from a spreadsheet The creation of this spreadsheet is supported by a Shiny App. Overall, 12 different logical comparisons can be applied. An overview is given in the respective tutorial. Each line within the spreadsheet defines one check rule.

checks <- prep_get_data_frame("cross-item_level") # cross-item_level is a sheet in ship_meta_v2.xlsx, which was loaded earlier

Subsequently, the contradictions assessment may be triggered by con_contradictions_redcap() using the table checks as the point of reference:

AnyContradictions <- con_contradictions_redcap(study_data    = sd1,
                                        meta_data            = md1,
                                        label_col            = "LABEL",
                                        meta_data_cross_item = checks,
                                        threshold_value      = 1)

Summary table

A summary table shows the number and percentage of contradictions for each defined rule:

AnyContradictions$SummaryTable

In this example, rule seven leads to the identification of 35 contradictions: age of onset for diabetes is provided (DIAB_AGE_ONSET_0), but the variable on the presence of diabetes (DIABETES_KNOWN_0) does not indicate a known disease.

Summary plot

The distributions may also be displayed as a plot:

AnyContradictions$SummaryPlot 

Accuracy

In contrast to most consistency related indicators, accuracy findings indicate an elevated probability that some data quality issue exists, rather than a certain issue.

Univariate outlier

Univariate outliers are assessed based on statistical criteria. The covered indicator is:

The function acc_robust_univariate_outlier() identifies outliers according to the approaches of Tukey, SixSigma, Hubert, and the heuristic approach of SigmaGap. It may be called as follows:

UnivariateOutlier <- dataquieR:::acc_robust_univariate_outlier(study_data      = sd1,
                                                               meta_data       = md1,
                                                               label_col       = "LABEL")

Summary table

AThe first output is a table that provides descriptive statistics and detected outliers according to the different criteria:

UnivariateOutlier$SummaryTable

There are outliers according to at least two criteria in most variables, but only for the diastolic blood pressure variables (DBP_0.1 and DBP_0.2) two outliers have been detected using the Sigma-gap criterion.

Summary plot

To obtain a better insight on univariate distributions, a plot is provided (call it with UnivariateOutlier$SummaryPlotList). It highlights observations for each variable according to the number of violated rules (only the first four are shown here):

Multivariate outlier

The function acc_multivariate_outlier() identifies outliers related to the indicator:

acc_multivariate_outlier() uses the same rules as acc_robust_univariate_outlier() for the identification of outliers.

The following function call relates systolic and diastolic blood pressure measurement to age and weight and a table output is created for the number of detected multivariate outliers:

MVO_SBP0.1 <- acc_multivariate_outlier(variable_group = c("SBP_0.1", "DBP_0.1", "AGE_0", "BODY_WEIGHT_0"),
                                       study_data      = sd1,
                                       meta_data       = md1,
                                       id_vars         = "ID",
                                       label_col       = "LABEL")

MVO_SBP0.1$SummaryTable

The number of outliers varies considerably, depending on the criterion. Subsequently a parallel-coordinate-plot may be requested to further inspect results:

MVO_SBP0.1$SummaryPlot

Another example is the inspection of the first and second systolic blood pressure measurements:

MVO_DBP <- acc_multivariate_outlier(variable_group = c("SBP_0.1", "SBP_0.2"),
                                       study_data      = sd1,
                                       meta_data       = md1,
                                       label_col       = "LABEL")

MVO_DBP$SummaryTable
MVO_DBP$SummaryPlot


Distribution

The function acc_distributions() describes distributions using histograms and displays empirical cumulative distribution function (ecdf) if a grouping variable is provided. The function is only descriptive and not related to a specific indicator. Instead, the results may relevant to most indicators within the unexpected distribution domain.

The following example examines measurements in which a possible influence of the examiners is considered:

ECDFSoma <- acc_distributions(resp_vars = c("WAIST_CIRC_0", "BODY_HEIGHT_0", "BODY_WEIGHT_0"),
                              group_vars = "OBS_SOMA_0",
                              study_data      = sd1,
                              meta_data       = md1,
                              label_col       = "LABEL")

The respective list of plots may be displayed using ECDFSoma$SummaryPlotList (only the first 2 plots are displayed below):


Margins

The function acc_margins() is mainly related to the indicators:

However, it also provides descriptive output such as violin and box plots for continuous variables, count plots for categorical data, and density plots for both. The main application of acc_margins() is to make inference on effects related to process variables such as examiners, devices, or study centers. The function determines whether measurements are provided as continuous or discrete. Alternatively, metadata specifications may provide this information.

In the first example acc_margins() is applied to the variable waist circumference (WAIST_CIRC_0). In this case, dependencies related to the examiners (OBS_SOMA_0) are assessed while the raw measurements are controlled for variable age and sex (AGE_0, SEX_0):

Waist circumference

marginal_dists <- acc_margins(resp_vars  = "WAIST_CIRC_0",
                              co_vars    = c("AGE_0", "SEX_0"),
                              group_vars = "OBS_SOMA_0",
                              study_data = sd1,
                              meta_data  = md1,
                              label_col  = "LABEL")

A plot is provided to view the results:

marginal_dists$SummaryPlot

Based on a statistical test, no mean waist circumference of any examiner differed substantially (p<0.05) from the overall mean.

Myocardial infarction

The situation is different when assessing the coded myocardial infarction (MYOCARD_YN_0) across examiners while controlling for age and sex:

marginal_dists <- acc_margins(resp_vars  = "MYOCARD_YN_0",
                              co_vars    =c("AGE_0", "SEX_0"),
                              group_vars = "OBS_INT_0",
                              study_data      = sd1,
                              meta_data       = md1,
                              label_col       = "LABEL")

marginal_dists$SummaryPlot

The result shows elevated proportions for the examiners 05 and 07.

Variance components

An important and related issue is the quantification of the observed examiner effects, which is accomplished by the function acc_varcomp() related to the indicators:

acc_varcomp() computes the percentage of variance of some target variable, here attributable to the grouping variable while controlling for some other variables (age and sex). The output may be reviewed in a table format:

vcs <- acc_varcomp(resp_vars  = "WAIST_CIRC_0",
                   co_vars    = c("AGE_0", "SEX_0"),
                   group_vars = "OBS_SOMA_0",
                   study_data = sd1,
                   meta_data  = md1,
                   label_col  = "LABEL")

vcs$SummaryTable

For the variable WAIST_CIRC_0, an ICC of 0.019 has been found which is below the threshold. The same is the case for the variable MYOCARD_YN_0, probably because the case count in the two deviant observers 05 and 07 is low:

vcs <- acc_varcomp(resp_vars  = "MYOCARD_YN_0",
                   co_vars    =c("AGE_0", "SEX_0"),
                   group_vars = "OBS_INT_0",
                   study_data      = sd1,
                   meta_data       = md1,
                   label_col       = "LABEL")

vcs$SummaryTable


LOESS

The study of effects across groups and times is particularly complex. The function acc_loess() provides a descriptor related to the indicator:

acc_loess() may also be used to obtain information related to other indicators in the domain of unexpected distributions. A sample call using waist circumference as the target variable is:

timetrends <- acc_loess(resp_vars  = "WAIST_CIRC_0",
                        co_vars    =c("AGE_0", "SEX_0"),
                        group_vars = "OBS_SOMA_0",
                        time_vars  = "EXAM_DT_0",
                        study_data      = sd1,
                        meta_data       = md1,
                        label_col       = "LABEL")

invisible(lapply(timetrends$SummaryPlotList, print))

The graph for this variable indicates no major discrepancies between the examiners over the examination period.

Shape

Assessing the shape of a distribution is, next to location parameters, an important aspect of accuracy.

The related indicator is:

Observed distributions can be tested against expected distributions using the function acc_shape_or_scale().

In this example the uniform distribution for the use of measurement devices is examined:

MyUnexpDist1 <- acc_shape_or_scale(resp_vars  = "DEV_BP_0", 
                                   guess      = TRUE, 
                                   label_col  =  "LABEL",
                                   dist_col   = "DISTRIBUTION",
                                   study_data = sd1, 
                                   meta_data  = md1)

MyUnexpDist1$SummaryPlot

The plot illustrates that devices have not been used with comparable frequencies.

Another example examines the normal distribution of blood pressure:

MyUnexpDist2 <- acc_shape_or_scale(resp_vars  = "SBP_0.2", 
                                   guess      = TRUE, 
                                   label_col  =  "LABEL",
                                   dist_col   = "DISTRIBUTION",
                                   study_data = sd1, 
                                   meta_data  = md1)

MyUnexpDist2$SummaryPlot

The result reveals a slight discrepancy from the normality assumption. It is up to the person responsible for the data quality assessments to decide whether such a discrepancy is relevant.


End digit preferences

The analysis of end digit preferences is a specific implementation related to the indicator:

In this example the uniform distribution of the end digits of body height are examined. Body height in SHIP-START-0 was a measurement which required the manual reading and transfer of data into an eCRF.

MyEndDigits <- acc_end_digits(resp_vars  = "BODY_HEIGHT_0", 
                              label_col  = LABEL,
                              study_data = sd1, 
                              meta_data  = md1)

MyEndDigits$SummaryPlot

The graph highlights no relevant effects across the ten categories.

Output within the accuracy dimension frequently combines descriptive and inferential content, which is necessary to facilitate valid conclusions on data quality issues. Further details on all functions can be obtained following the links and in the software section.

Back to Overview