Introduction

The most basic data quality (DQ) assessments target single data elements (variables/items). Data quality assessments therefore require item level metadata. For detailed information on the item level metadata please see Richter et al. 2019. An overview on item level metadata used by dataquieR is provided below.

Item level metadata for data quality reporting

VARIABLE AND VALUE LABELS

VAR_NAMES

The first column specifies the variable name in the study data to be analysed. The input must be a string without blank spaces.

LABEL

Appropriate labels are a necessary precondition for readable data quality reports. Their absence, however, does not affect the functionality of the statistical implementations.

CAVEAT: A necessary convention for all labels in the current project phase is the definition of unique and short labels. This is necessary since labels that are too long may corrupt reports.

Assigning labels to variables is important because variable names in the study data are rather technical and may only be interpretable to a limited extent. As variable names, each variable label should be unique. In addition, labels should be as short as possible to ensure a readable output. To enhance the presentation and plotting quality, character length specified in LABELS should not exceed 20 characters.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

VAR_NAMES	LABEL
v00000	CENTER_0
v00001	PSEUDO_ID
v00002	SEX_0
v00003	AGE_0
v00103	AGE_GROUP_0

All implementations of dataquieR support the use of LABELS.

LONG_LABEL

Under some circumstances, a short label or variable name is insufficient to provide all necessary information. The column “LONG_LABEL” can be filled with self-explaining annotations for variables. Long labels are more relevant for tabular output than for graphical output.

Short or long labels can be defined in all implementations of dataquieR by specifying the label_col formal as an input.

VALUE LABELS

Categorical variables in the study data are often coded as integers (e.g. 0, 1). Because the number is non-informative, labels are essential to secure understandable reports, e.g.:

The sex of participants can be coded as \(0 = females\) and \(1 = males\).
The presence of a disease can be coded as \(0 = no\) and \(1 = yes\).

VALUE_LABELS

To make use of VALUE_LABELS in dataquieR, the following convention has been made: all values of a study variable and respective labels can be summarized in a list using the pipe operator \(|\) for separation. For example:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	VALUE_LABELS
3	v00002	SEX_0	0 = females \| 1 = males
12	v00007	ASTHMA_0	0 = no \| 1 = yes

To enhance presentation and plotting quality, the character length of a value label specified in VALUE_LABELS should not exceed 20 characters.

VALUE_LABEL_TABLE

In alternative, all values of a study variable and respective labels can be written in a table in a separate sheet of the same workbook. The name of a table containing these values can be indicated in the column VALUE_LABEL_TABLE. The input must be a string. The spreadsheet containing the corresponding table name and relative information is in a spreadsheet in the same workbook, called CODE_LIST_TABLE (see here for more details).

In the example below, the values and labels for the categorical variable SEX_0 are specified in the table called sex_vl.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

VAR_NAMES	LABEL	VALUE_LABEL_TABLE
v00002	SEX_0	sex_vl
v00007	ASTHMA_0	asthma_vl

The corresponding table name and relative information are in a spreadsheet in the same workbook, called [CODE_LIST_TABLE] (VIN_CODE_LIST_TABLE.html) that looks as follow:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

CODE_VALUE	CODE_LABEL	CODE_CLASS	CODE_LIST_TABLE
0	females	VALUE	sex_vl
1	males	VALUE	sex_vl
0	no	VALUE	asthma_vl
1	yes	VALUE	asthma_vl

DATA_TYPE

In contrast to [LABEL] the definition of the DATA_TYPE is crucial because the applicability of dataquieR functions may depend on the data type.

The following DATA_TYPES are differentiated in dataquieR:

float
integer
datetime
string

The list appears small compared to some electronic data capturing systems (e.g. REDCap, Harris et al. 2009) or Shiny Apps (Chang et al. 2018). However, the data type should not be mixed up with data entry types which could be very different using sliders or radio buttons. Similarly, the data type is not a statistical property such as an ordinal characteristic.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	DATA_TYPE
2	v00001	PSEUDO_ID	string
3	v00002	SEX_0	integer
9	v00004	SBP_0	float
10	v00005	DBP_0	float
19	v00013	EXAM_DT_0	datetime

SCALE_LEVEL

The definition of the SCALE_LEVEL is important because it defines what type of mathematical operations and which dataquieR functions can be applied to the data.
The classification of measurement levels used by dataquieR is based one the one defined by Stevens (Stanley Smith 1946). The following SCALE_LEVEL entries are allowed:

nominal
ordinal
interval
ratio
na

The category na is used for variables that do not fit in the other categories (e.g., unstructured texts, json, xml).

MISSINGS

Data often contain a qualification of values which are not measurements. These are for example codes for missing values. The figure below shows the use of such codes in the variable V_0101. Both, measurement values and missing codes are considered as data values.

Categorization of measurements and missing values in dataquieR

Using such codes may complicate the application of standardized routines for DQ assessment since coded missing measurements must be correctly interpreted. For example, it must be secured that a data value representing a missing code is not treated as a measurement value to avoid spurious results when addressing data accuracy. Therefore codes representing non-measurement values must be correctly identified and treated correctly.

dataquieR distinguishes missing codes through the MISSING_LIST metadata column and jump codes by the JUMP_LIST column. However, providing the codes in an additional table called missing table is recommended. For this, a string with the table name can be specified in the column MISSING_LIST_TABLE, referring to a spreadsheet in the same or another workbook or an URL.

CAVEAT: Currently, within a variable, dataquieR only accepts missing codes that match the data type of the variable. For dates, for example, the missing table can only contain missing codes in datetime format, such as 1800.01.01 00:00:00 AM; and not numeric codes such as 99981. This must be kept in mind for the metadata columns MISSING_LIST, JUMP_LIST and the MISSING_LIST_TABLE.

MISSING_LIST

Codes specified in the MISSING_LIST indicate unexpected missingness of measurements, for example missing values due to refusals or technical problems.

The MISSING_LIST is a list of pipe \(|\) separated numeric codes: \(99980\: |\: 99983\: |\: 99988\).

JUMP_LIST

Codes in the JUMP_LIST indicate measurements which are missing by design. For example, if a sub-sample of a study population does not participate in a specific examination (by design), then jump codes should be used to indicate this reason for missingness.

The JUMP_LIST is a list of pipe \(|\) separated numeric codes: \(88880\: |\: 88883\: |\: 88884\).

MISSING_LIST_TABLE

The name and location of a table containing the missing assignments for the respective variable. The input must be a string and can refer to a spreadsheet in the same or another workbook.

In the example below, the missing codes for the variables are specified in the sheet called missing_table of the same workbook.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	MISSING_LIST_TABLE
9	v00004	SBP_0	missing_table
10	v00005	DBP_0	missing_table
11	v00006	GLOBAL_HEALTH_VAS_0	missing_table

If the codes were defined in another workbook, the path and name of the spreadsheet must be given. For example: "d:/data/questionnaire_data_codes.xlsx | missing_codes".

Note: in case of multiple sheets of missing tables, it is possible to have just one sheet in the same workbook called CODE_LIST_TABLE (read more here).

LIMITS

Limits describe ranges to check the plausibility of measurement values (hard, soft limits) or to identify measurements outside a measurable range (detection limits). Limits may apply to study data of type float, integer, and datetime. Specifying limits can be content-driven (e.g., based on clinical information) or may depend on properties of the used examination device or the outcome under study. For example, body weight cannot be negative.

Unfortunately, the definition of limits can be ambiguous:

A plausibility limit of “\(\gt10\)” may imply that all values above are plausible.
However, this notation is also frequently used to guide decisions in eCRFs, i.e. if a value is “\(\gt10\)”, the user should be alerted to an implausible value.

To avoid this ambiguity, HARD_LIMITS, SOFT_LIMITS, and DETECTION_LIMITS in the metadata are defined using interval notation. Values inside the interval are eligible/plausible/possible. The definition of intervals adheres also to a distinguished use of braces:

\((0;\:10)\): open interval, i.e. values \(>0\) and \(<10\) are inside the interval.
\((0;\:10]\): left-open interval, i.e. values \(>0\) and \(\le10\) are inside the interval.
\([0;\:10)\): right-open interval, i.e. values \(\ge0\) and \(<10\) are inside the interval.
\([0;\:10]\): is a closed interval, i.e. values \(\ge0\) and \(\le10\) are inside the interval.

Each side of the interval must be defined by a value of the same type as the measurement (including dates and datetimes). If the range is undefined, \(-Inf\) and/or \(Inf\) have to be defined. Please see the examples provided in [Metadata in dataquieR].

Two types of limits may be distinguished depending on whether the range indicates inadmissible or just unlikely values.

HARD_LIMITS

HARD_LIMITS should be specified to identify inadmissible values. Inadmissibility does not necessarily mean impossible. For example, while it is known that the heaviest man on earth did weigh more than 600kg, it may be reasonable to declare values above 250kg as inadmissible, because under the circumstances of a general population study in Germany it is deemed unlikely that a heavier person may arrive at the examination center.

For example, for blood pressure measurements, we may specify the following hard limits.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	HARD_LIMITS
9	v00004	SBP_0	[80;180]
10	v00005	DBP_0	[50;Inf)

SOFT_LIMITS

The functionality of SOFT_LIMITS is similar to HARD_LIMITS. However, values outside the limits are not removed, because SOFT_LIMITS indicate improbable but not inadmissible measurements.

The formal setup of SOFT_LIMITS is identical to HARD_LIMITS, as shown in the metadata excerpt below.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	SOFT_LIMITS
9	v00004	SBP_0	(90;170)
10	v00005	DBP_0	(55;100)

DETECTION_LIMITS

The definition of DETECTION_LIMITS can be necessary if measurement devices have predefined limits of sensitivity. It is possible that measurements are indicated as being below or above the DETECTION_LIMITS. Such information should result in a different management of respective data values as they are still informative and can be used in later analysis.

Values outside detection limits are not removed.

The formal setup of DETECTION_LIMITS is identical to HARD_LIMITS and SOFT_LIMITS, as can be seen in the metadata example below.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	DETECTION_LIMITS
9	v00004	SBP_0	[0;265]
10	v00005	DBP_0	[0;265]
21	v00014	CRP_0	[0.16;Inf)

ACCURACY RELATED

DISTRIBUTION

If the expected probability distribution of an outcome is known, deviations from this distribution can be examined. For example, systolic blood pressure may be assumed to approximately follow a normal distribution. One examiner of blood pressure measurement may round values to decadic numbers which can lead to a serious distortion of the data.

Currently, the following probability distributions are supported for an assessment by dataquieR with dedicated DQ-implementations:

normal
uniform
gamma

This restriction does not impede consideration of count data.

As shown below, we may expect a normal distribution for blood pressure variables, while the expectation may be a gamma distribution for the C-reactive protein measurement.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	DISTRIBUTION
9	v00004	SBP_0	normal
10	v00005	DBP_0	normal
21	v00014	CRP_0	gamma

DECIMALS

In the column DECIMALS, the accuracy of measurements in terms of expected decimal numbers can be specified. This may for example be necessary, if rounding is expected when addressing end digit preferences.

In the example below, no decimals are expected for the blood pressure variables, whereas three decimals are expected for the C-reactive protein measurement.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	DECIMALS
9	v00004	SBP_0	0
10	v00005	DBP_0	0
21	v00014	CRP_0	3

UNIVARIATE_OUTLIER_CHECKTYPE

Sets the type of check for the univariate assessment of outliers. Either Tukey, 3SD, Hubert or SigmaGap approaches are currently supported as input.

For instance, in the metadata excerpt below, SigmaGap and Tukey are specified.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	UNIVARIATE_OUTLIER_CHECKTYPE
37	v00028	INCOME_GROUP_0	Sigma_gap \| Tukey

N_RULES

Specifies the number of rules that must be violated for an observation to be flagged as an outlier. It applies to all potential assessment rules for univariate outliers.

The metadata example below specifies the default testing of the four outlier criteria (Tukey, 3SD, Hubert and SigmaGap) for the smoking variable. Since N_RULES is four, an observation will be flagged as an outlier if all criteria classify it as an outlier. In contrast, for the income variable, only SigmaGap and Tukey are specified; hence the maximum number of rules that can be violated is two, and an observation will be marked as an outlier if identified as such by both criteria.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	UNIVARIATE_OUTLIER_CHECKTYPE	N_RULES
33	v00024	SMOKING_0	NA	4
37	v00028	INCOME_GROUP_0	Sigma_gap \| Tukey	2

LOCATION_METRIC

Defines the location metric to be used. Either mean or median are supported as input. In the example below, the mean is specified for both blood pressure variables.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	LOCATION_METRIC
9	v00004	SBP_0	Mean
10	v00005	DBP_0	Mean

LOCATION_RANGE

Defines the range for the expected location parameter specified in LOCATION_METRIC, using the interval notation described above for limits. For the blood pressure variables, the example below shows the expected location of the mean value.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	LOCATION_METRIC	LOCATION_RANGE
9	v00004	SBP_0	Mean	(100;140)
10	v00005	DBP_0	Mean	(60;100)

PROPORTION_RANGE

Defines the range of expected proportions of individual values of categorical variables using the interval notation. For instance, the metadata excerpt below shows the specification of the expected sex proportions in the data.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	VALUE_LABELS	PROPORTION_RANGE
3	v00002	SEX_0	0 = females \| 1 = males	(48;52)
7	v01002	SEX_1	0 = females \| 1 = males	(48;52)

Ranges of expected proportions may differ between individual values of a categorical variable. Individual ranges can be specified using the pipe operator \(|\) for separation, for example:

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	VALUE_LABELS	PROPORTION_RANGE
11	obs_soma	OBS_SOMA_0	1 = Obs_01 \| 2 = Obs_02 \| 3 = Obs_03 \| 4 = Obs_04 \| 5 = Obs_05 \| 6 = Obs_06 \| 7 = Obs_07 \| 8 = Obs_08 \| 9 = Obs_09 \| 10 = Obs_10	1 = [0;5) \| 2 = [0;5) \| 3 = [15;30] \| 4 = [15;30] \| 5 = [15;30] \| 6 = [0;5) \| 7 = [15;30] \| 8 = [0;5) \| 9 = [15;30] \| 10 = [0;5)

CO_VARS

Defines the variables to be used as covariates in regression analyses, for example when assessing potential observer effects. For the blood pressure variables, both age and sex are taken as covariates in the metadata excerpt below.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	CO_VARS
9	v00004	SBP_0	AGE_0 \| SEX_0
10	v00005	DBP_0	AGE_0 \| SEX_0

PROCESS-RELATED METADATA

Process-related metadata provides information on the circumstances under which a measurement has been collected. Such information may be the ID of an examiner and device, the time when a measurement has been conducted or the way of data entry. Related information is an essential precondition for the computation of several data quality indicators.

DATA_ENTRY_TYPE

Some measurements may be transferred from a measurement device to the case report form. Manual data entry can be prone to typos and errors. The respective measurements should be checked for abnormalities. DATA_ENTRY_TYPE acts as an indicator for manual data transfer \((0=none, 1=yes)\).

For instance, the data entry type may be automated from a blood pressure measurement device, but it could be manual for the blood cell sedimentation rate (BSG).

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	DATA_ENTRY_TYPE
9	v00004	SBP_0	0
10	v00005	DBP_0	0
22	v00015	BSG_0	1

GROUP COLUMNS

Group columns in the metadata specify relations between different (study data) variables. For example, a link should be established between variables related to measurements (e.g., systolic blood pressure, abbreviated to SBP) and the variables that indicate which examiners conducted these examinations. This information is necessary to calculate observer effects.

There are different types of process variables, which may be assigned through group columns. These are

GROUP_VAR_OBSERVER for the examiners,
GROUP_VAR_DEVICE for devices,
TIME_VAR for variables indicating the date and time at which measurements were conducted, and
STUDY_SEGMENT to denote the study segment (e.g., examination) to which a variable belongs.

The GROUP_VAR specification is essential to enable a semi-automatic analysis of observer or device effects, and the TIME_VAR specification is required to assess effects over time. Specifying group columns is a prerequisite for powerful, automated DQ reports.

An example of the definition of such metadata can be seen below.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	GROUP_VAR_OBSERVER	GROUP_VAR_DEVICE	TIME_VAR	STUDY_SEGMENT
9	v00004	SBP_0	USR_BP_0	NA	EXAM_DT_0	PHYS_EXAM
10	v00005	DBP_0	USR_BP_0	NA	EXAM_DT_0	PHYS_EXAM
14	v00009	ARM_CIRC_0	NA	ARM_CUFF_0	EXAM_DT_0	PHYS_EXAM
16	v00010	ARM_CUFF_0	NA	NA	NA	PHYS_EXAM
18	v00012	USR_BP_0	NA	NA	NA	PHYS_EXAM
19	v00013	EXAM_DT_0	NA	NA	NA	PHYS_EXAM

REPORT DESIGN

VARIABLE_ROLE

Usually not all variables of the study data will be subject to DQ reporting. To allow for simple filtering, different roles of variables can be defined. The number of roles is not limited. In dataquieR the following roles will be defined:

intro: administrative variables, for example indicating the participation in an examination;
primary: measurement variables of major importance. For this variable, all possible data quality results are created;
secondary: measurement variables of minor importance, with a reduced set of data quality results;
process: measurements of the data generating process under which study data were obtained. For example, room temperature or the respective examiner;
suppress: a variable needed in case referred to by other variables, but for which no data quality results will be created.

Currently only primary and suppress variable roles are supported.

The metadata defining the roles of the variables may be as follows.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

	VAR_NAMES	LABEL	VARIABLE_ROLE
2	v00001	PSEUDO_ID	intro
5	v00103	AGE_GROUP_0	secondary
9	v00004	SBP_0	primary
10	v00005	DBP_0	primary
16	v00010	ARM_CUFF_0	process

VARIABLE_ORDER

In this column, the order of the variables in a data quality report can be defined. For example, this column may be as follows.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

VAR_NAMES	LABEL	VARIABLE_ORDER
v00000	CENTER_0	1
v00001	PSEUDO_ID	2
v00002	SEX_0	3
v00003	AGE_0	4

DATAFRAMES

This column is used to state which data frame contains the current variable. The data frame is indicated using a code. This code is associate to the complete name of the data frame in the dataframe_level metadata (column DF_CODE). If a variable is present in more than one data frame, this can be indicated using the pipe symbol |.

## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")

VAR_NAMES	LABEL	DATAFRAMES
v00000	CENTER_0	dfA \| dfB
v00001	PSEUDO_ID	dfA \| dfB
v00002	SEX_0	dfA
v00003	AGE_0	dfB

How `dataquieR` uses item level metadata

dataquieR employs the predefined item level metadata in two ways:

For each variable of the study data named in a function call of a DQ implementation, the respective metadata are interpreted from a data frame of metadata.
Some implementations also search for relations between variables, such as a date-time-stamp that belongs to a measurement. The section GROUP COLUMNS explains the definition of such relations.

Therefore, metadata and study data must be defined in a 1:1 correspondence, i.e., each variable of the study data is identifiable in the metadata. The key for this mapping is the variable name, listed in the column VAR_NAMES in the metadata. A necessary convention regarding variable names is their uniqueness, i.e., none should have a duplicate (also implied by the 1:1 correspondence). Further, all metadata columns are defined in upper case letters to distinguish them from the study data.

Typical study data variables are participant identifiers, measurements (e.g., systolic blood pressure), and process variables (e.g., examiner ID) (left panel in the figure below). Metadata attributes are required for all variables in the study data (bottom right panel). For example, the metadata should contain a label and a list of codes for missing values for each of the study data variables (top right panel). Lists of expected categories should be specified if applicable.

Overview of metadata usage in dataquieR (Richter et al. 2019)

dataquieR uses the following terms for data structures:

Key terms related to the data structure used in dataquieR (Schmidt et al. 2021)

Back to Metadata

Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., et al. (2018). Shiny: Web application framework for r, 2015. R Package Version 1, 14.

Harris, P.A., Taylor, R., Thielke, R., Payne, J., Gonzalez, N., and Conde, J.G. (2009). Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42, 377–381.

Richter, A., Schössow, J., Werner, A., Schauer, B., Radke, D., Henke, J., Struckmann, S., and Schmidt, C. (2019). Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information. GMS Med Inform Biom Epidemiol 15.

Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.

Stanley Smith, S. (1946). On the theory of scales of measurement. Science 103, 677–680.

Definition and use of item level metadata