Introduction

The most basic data quality (DQ) assessments target single data elements (variables/items). Data quality assessments therefore require item level metadata. For detailed information on the item level metadata please see Richter et al. 2019. An overview on item level metadata used by dataquieR is provided below.


Item level metadata for data quality reporting


VARIABLE AND VALUE LABELS


VAR_NAMES

The first column specifies the variable name in the study data to be analysed. The input must be a string without blank spaces.


LABEL

Appropriate labels are a necessary precondition for readable data quality reports. Their absence, however, does not affect the functionality of the statistical implementations.

CAVEAT: A necessary convention for all labels in the current project phase is the definition of unique and short labels. This is necessary since labels that are too long may corrupt reports.

Assigning labels to variables is important because variable names in the study data are rather technical and may only be interpretable to a limited extent. As variable names, each variable label should be unique. In addition, labels should be as short as possible to ensure a readable output. To enhance the presentation and plotting quality, character length specified in LABELS should not exceed 20 characters.

VAR_NAMES LABEL
v00000 CENTER_0
v00001 PSEUDO_ID
v00002 SEX_0
v00003 AGE_0
v00103 AGE_GROUP_0

All implementations of dataquieR support the use of LABELS.


LONG_LABEL

Under some circumstances, a short label or variable name is insufficient to provide all necessary information. The column “LONG_LABEL” can be filled with self-explaining annotations for variables. Long labels are more relevant for tabular output than for graphical output.

Short or long labels can be defined in all implementations of dataquieR by specifying the label_col formal as an input.


VALUE_LABELS

Categorical variables in the study data are often coded as integers (e.g. 0, 1). Because the number is non-informative, labels are essential to secure understandable reports, e.g.:

  • The sex of participants can be coded as \(0 = females\) and \(1 = males\).
  • The presence of a disease can be coded as \(0 = no\) and \(1 = yes\).

To make use of VALUE_LABELS in dataquieR, the following convention has been made: all values of a study variable and respective labels can be summarized in a list using the pipe operator \(|\) for separation. For example:

VAR_NAMES LABEL VALUE_LABELS
3 v00002 SEX_0 0 = females | 1 = males
12 v00007 ASTHMA_0 0 = no | 1 = yes

To enhance presentation and plotting quality, the character length of a value label specified in VALUE_LABELS should not exceed 20 characters.


DATA_TYPE

In contrast to [LABEL] the definition of the DATA_TYPE is crucial because the applicability of dataquieR functions may depend on the data type.

The following DATA_TYPES are differentiated in dataquieR:

  • float
  • integer
  • datetime
  • string

The list appears small compared to some electronic data capturing systems (e.g. REDCap, Harris et al. 2009) or Shiny Apps (Chang et al. 2018). However, the data type should not be mixed up with data entry types which could be very different using sliders or radio buttons. Similarly, the data type is not a statistical property such as an ordinal characteristic.

VAR_NAMES LABEL DATA_TYPE
2 v00001 PSEUDO_ID string
3 v00002 SEX_0 integer
9 v00004 SBP_0 float
10 v00005 DBP_0 float
19 v00013 EXAM_DT_0 datetime


SCALE_LEVEL

The definition of the SCALE_LEVEL is important because it defines what type of mathematical operations and which dataquieR functions can be applied to the data.

The following SCALE_LEVEL entries are allowed in dataquieR:

  • nominal
  • ordinal
  • interval
  • ratio
  • na

The category na is used for variables that do not fit in the other categories (e.g., unstructured texts, json, xml).


MISSINGS

Data often contain a qualification of values which are not measurements. These are for example codes for missing values. The figure below shows the use of such codes in the variable V_0101. Both, measurement values and missing codes are considered as data values.


Categorization of measurements and missing values in dataquieR
Categorization of measurements and missing values in dataquieR


Using such codes may complicate the application of standardized routines for DQ assessment since coded missing measurements must be correctly interpreted. For example, it must be secured that a data value representing a missing code is not treated as a measurement value to avoid spurious results when addressing data accuracy. Therefore codes representing non-measurement values must be correctly identified and treated correctly.

dataquieR distinguishes missing codes through the MISSING_LIST metadata column and jump codes by the JUMP_LIST column. However, providing the codes in an additional table called missing table is recommended. For this, a string with the table name can be specified in the column MISSING_LIST_TABLE, referring to a spreadsheet in the same or another workbook or an URL.

CAVEAT: Currently, within a variable, dataquieR only accepts missing codes that match the data type of the variable. For dates, for example, the missing table can only contain missing codes in datetime format, such as 1800.01.01 00:00:00 AM; and not numeric codes such as 99981. This must be kept in mind for the metadata columns MISSING_LIST, JUMP_LIST and the MISSING_LIST_TABLE.


MISSING_LIST

Codes specified in the MISSING_LIST indicate unexpected missingness of measurements, for example missing values due to refusals or technical problems.

The MISSING_LIST is a list of pipe \(|\) separated numeric codes: \(99980\: |\: 99983\: |\: 99988\).


JUMP_LIST

Codes in the JUMP_LIST indicate measurements which are missing by design. For example, if a sub-sample of a study population does not participate in a specific examination (by design), then jump codes should be used to indicate this reason for missingness.

The JUMP_LIST is a list of pipe \(|\) separated numeric codes: \(88880\: |\: 88883\: |\: 88884\).


MISSING_LIST_TABLE

The name and location of a table containing the missing assignments for the respective variable. The input must be a string and can refer to a spreadsheet in the same or another workbook.

In the example below, the missing codes for the variables are specified in the sheet called missing_table of the same workbook.

VAR_NAMES LABEL MISSING_LIST_TABLE
9 v00004 SBP_0 missing_table
10 v00005 DBP_0 missing_table
11 v00006 GLOBAL_HEALTH_VAS_0 missing_table

If the codes were defined in another workbook, the path and name of the spreadsheet must be given. For example: "d:/data/questionnaire_data_codes.xlsx | missing_codes".


LIMITS

Limits describe ranges to check the plausibility of measurement values (hard, soft limits) or to identify measurements outside a measurable range (detection limits). Limits may apply to study data of type float, integer, and datetime. Specifying limits can be content-driven (e.g., based on clinical information) or may depend on properties of the used examination device or the outcome under study. For example, body weight cannot be negative.

Unfortunately, the definition of limits can be ambiguous:

  • A plausibility limit of “\(\gt10\)” may imply that all values above are plausible.
  • However, this notation is also frequently used to guide decisions in eCRFs, i.e. if a value is “\(\gt10\)”, the user should be alerted to an implausible value.


To avoid this ambiguity, HARD_LIMITS, SOFT_LIMITS, and DETECTION_LIMITS in the metadata are defined using interval notation. Values inside the interval are eligible/plausible/possible. The definition of intervals adheres also to a distinguished use of braces:

  • \((0;\:10)\): open interval, i.e. values \(>0\) and \(<10\) are inside the interval.
  • \((0;\:10]\): left-open interval, i.e. values \(>0\) and \(\le10\) are inside the interval.
  • \([0;\:10)\): right-open interval, i.e. values \(\ge0\) and \(<10\) are inside the interval.
  • \([0;\:10]\): is a closed interval, i.e. values \(\ge0\) and \(\le10\) are inside the interval.

Each side of the interval must be defined by a value of the same type as the measurement (including dates and datetimes). If the range is undefined, \(-Inf\) and/or \(Inf\) have to be defined. Please see the examples provided in [Metadata in dataquieR].

Two types of limits may be distinguished depending on whether the range indicates inadmissible or just unlikely values.


HARD_LIMITS

HARD_LIMITS should be specified to identify inadmissible values. Inadmissibility does not necessarily mean impossible. For example, while it is known that the heaviest man on earth did weigh more than 600kg, it may be reasonable to declare values above 250kg as inadmissible, because under the circumstances of a general population study in Germany it is deemed unlikely that a heavier person may arrive at the examination center.

For example, for blood pressure measurements, we may specify the following hard limits.

VAR_NAMES LABEL HARD_LIMITS
9 v00004 SBP_0 [80;180]
10 v00005 DBP_0 [50;Inf)


SOFT_LIMITS

The functionality of SOFT_LIMITS is similar to HARD_LIMITS. However, values outside the limits are not removed, because SOFT_LIMITS indicate improbable but not inadmissible measurements.

The formal setup of SOFT_LIMITS is identical to HARD_LIMITS, as shown in the metadata excerpt below.

VAR_NAMES LABEL SOFT_LIMITS
9 v00004 SBP_0 (90;170)
10 v00005 DBP_0 (55;100)


DETECTION_LIMITS

The definition of DETECTION_LIMITS can be necessary if measurement devices have predefined limits of sensitivity. It is possible that measurements are indicated as being below or above the DETECTION_LIMITS. Such information should result in a different management of respective data values as they are still informative and can be used in later analysis.

Values outside detection limits are not removed.

The formal setup of DETECTION_LIMITS is identical to HARD_LIMITS and SOFT_LIMITS, as can be seen in the metadata example below.

VAR_NAMES LABEL DETECTION_LIMITS
9 v00004 SBP_0 [0;265]
10 v00005 DBP_0 [0;265]
21 v00014 CRP_0 [0.16;Inf)


REPORT DESIGN

VARIABLE_ROLE

Usually not all variables of the study data will be subject to DQ reporting. To allow for simple filtering, different roles of variables can be defined. The number of roles is not limited. In dataquieR the following roles will be defined:

  • intro: administrative variables, for example indicating the participation in an examination;

  • primary: measurement variables of major importance. For this variable, all possible data quality results are created;

  • secondary: measurement variables of minor importance, with a reduced set of data quality results;

  • process: measurements of the data generating process under which study data were obtained. For example, room temperature or the respective examiner;

  • suppress: a variable needed in case referred to by other variables, but for which no data quality results will be created.

Currently only primary and suppress variable roles are supported.

The metadata defining the roles of the variables may be as follows.

VAR_NAMES LABEL VARIABLE_ROLE
2 v00001 PSEUDO_ID intro
5 v00103 AGE_GROUP_0 secondary
9 v00004 SBP_0 primary
10 v00005 DBP_0 primary
16 v00010 ARM_CUFF_0 process


VARIABLE_ORDER

In this column, the order of the variables in a data quality report can be defined. For example, this column may be as follows.

VAR_NAMES LABEL VARIABLE_ORDER
v00000 CENTER_0 1
v00001 PSEUDO_ID 2
v00002 SEX_0 3
v00003 AGE_0 4


How dataquieR uses item level metadata

dataquieR employs the predefined item level metadata in two ways:

  1. For each variable of the study data named in a function call of a DQ implementation, the respective metadata are interpreted from a data frame of metadata.

  2. Some implementations also search for relations between variables, such as a date-time-stamp that belongs to a measurement. The section GROUP COLUMNS explains the definition of such relations.

Therefore, metadata and study data must be defined in a 1:1 correspondence, i.e., each variable of the study data is identifiable in the metadata. The key for this mapping is the variable name, listed in the column VAR_NAMES in the metadata. A necessary convention regarding variable names is their uniqueness, i.e., none should have a duplicate (also implied by the 1:1 correspondence). Further, all metadata columns are defined in upper case letters to distinguish them from the study data.

Typical study data variables are participant identifiers, measurements (e.g., systolic blood pressure), and process variables (e.g., examiner ID) (left panel in the figure below). Metadata attributes are required for all variables in the study data (bottom right panel). For example, the metadata should contain a label and a list of codes for missing values for each of the study data variables (top right panel). Lists of expected categories should be specified if applicable.

Overview of metadata usage in dataquieR (Richter et al. 2019)
Overview of metadata usage in dataquieR (Richter et al. 2019)


dataquieR uses the following terms for data structures:

Key terms related to the data structure used in dataquieR (Schmidt et al. 2021)
Key terms related to the data structure used in dataquieR (Schmidt et al. 2021)


Back to Metadata

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., et al. (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.
Harris, P.A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., and Conde, J.G. (2009). Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42, 377–381.
Richter, A., Schössow, J., Werner, A., Schauer, B., Radke, D., Henke, J., Struckmann, S., and Schmidt, C. (2019). Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information. GMS Med Inform Biom Epidemiol 15.
Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.