Introduction

Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and guide data quality (DQ) assessments as well as statistical analyses. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata is specific for certain DQ assessments, while others will be used across most DQ implementations.


Storage of metadata

Metadata is commonly stored in data dictionaries (DDs). DDs frequently contain the name of a variable, its data type, and, if applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for the study data of each research study. However, DDs often host only a subset of all information necessary for data quality assessments. Thus, DDs need to be extended on aspects related to data quality. If this is not possible, metadata may also be stored in a spreadsheet-type format, such as data frames. dataquieR uses predefined metadata provided as data frames, as described below.


How dataquieR uses metadata

The metadata schema used by dataquieR is based on a formal data quality framework for observational studies (Schmidt et al. 2021). dataquieR uses metadata that has been organized in a structured form across tables or sheets:

  1. item_level: descriptions and expectations about single data elements (variables/items), e.g. columns in the study data table.
  2. cross-item_level: descriptions and expectations about the joint use of two or more data elements (variables/items) for data quality assessments.
  3. segment_level: descriptions and expectations about the provided segments (e.g., different study examinations).
  4. dataframe_level: descriptions and expectations about entire data frames.
  5. missings: defines missing and jump assignments per variable.
  6. expected_IDs: specifies reference tables for participant IDs at the segment and data frame levels.

Each metadata table is arranged as a spreadsheet in a workbook to facilitate user input. Users can provide metadata directly in the spreadsheet or by specifying the source file for a specific item (e.g., another spreadsheet or an URL). Additionally, the tables can contain information to control the report output (e.g., the role or order of variables in the report) and the calculation of the quality indicators.

NOTE 1: in all metadata tables, the column names are written in upper case letters to distinguish them from the column names in the study data.

NOTE 2: the names of the sheets 1 to 4 are predefined. However, the names of the missing (5) and expected_IDs (6) sheets are user defined, as they must correspond to the information entered in the other sheets.

For instance, in the example metadata, in the item_level sheet, under MISSING_LIST_TABLE, there is a missing_table entry:

VAR_NAMES LABEL MISSING_LIST_TABLE
9 v00004 SBP_0 missing_table
10 v00005 DBP_0 missing_table
11 v00006 GLOBAL_HEALTH_VAS_0 missing_table


This means that the name of the missing (5) sheet in that same workbook is “missing_table”.

Similarly, the name of the expected_IDs (6) sheet depends on the ID_REF_TABLE defined in the segment_level or dataframe_level sheets:

DF_NAME DF_ID_REF_TABLE
study_data expected_id
lab_data expected_id
questionnaire_data d:/data/questionnaire_data.xlsx | pseudo_id


In the example above, the sheets with the expected_IDs (6) are named “expected_id” (in the current workbook) and “pseudo_id” (which is a sheet in the questionnaire_data.xlxs workbook).


NOTE 3 (for advanced users): There is a convention for tables mentioned inside metadata columns ending with the suffix __“_TABLE”__. These tables are automatically imported with the function prep_get_data_frame() whenever needed by a function. They are also stored in the data frames storage area of dataquieR with all the other metadata tables. For example, this happens for the tables mentioned in: the item_level metadata columns: STANDARDIZED_VOCABULARY_TABLE, MISSING_LIST_TABLE, VALUE_LABEL_TABLE; the segment_level metadata column SEGMENT_ID_TABLE; and the dataframe_level metadata column DF_ID_REF_TABLE.

Use prep_list_dataframes()to see the content of this storage area and look here for more information on how to revise the content of the dataquieR data frame storage area.

Back to Tutorials

Meyer, J., Ostrzinski, S., Fredrich, D., Havemann, C., Krafczyk, J., and Hoffmann, W. (2012). Efficient data management in a large-scale epidemiology research project. Computer Methods and Programs in Biomedicine 107, 425–435.
Nadkarni, P.M. (2011). Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge (Springer London).
Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.