Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and guide data quality (DQ) assessments as well as statistical analyses. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata is specific for certain DQ assessments, while others will be used across most DQ implementations.

Storage of metadata

Metadata is commonly stored in data dictionaries (DDs). DDs frequently contain the name of a variable, its data type, and, if applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for the study data of each research study. However, DDs often host only a subset of all information necessary for data quality assessments. Thus, DDs need to be extended on aspects related to data quality. If this is not possible, metadata may also be stored in a spreadsheet-type format, such as data frames. dataquieR uses predefined metadata provided as data frames, as described below.

How dataquieR uses metadata

The metadata schema used by dataquieR is based on a formal data quality framework for observational studies (Schmidt et al. 2021). dataquieR uses metadata that has been organized in a structured form across tables or sheets:

  1. item_level: descriptions and expectations about single data elements (variables/items), e.g. columns in the study data table.
  2. cross-item_level: descriptions and expectations about the joint use of two or more data elements (variables/items) for data quality assessments.
  3. segment_level: descriptions and expectations about the provided segments (e.g., different study examinations).
  4. dataframe_level: descriptions and expectations about entire data frames.
  5. missings: defines missing and jump assignments per variable.
  6. expected_IDs: specifies reference tables for participant IDs at the segment and data frame levels.

Each metadata table is arranged as a spreadsheet in a workbook to facilitate user input. Users can provide metadata directly in the spreadsheet or by specifying the source file for a specific item (e.g., another spreadsheet or an URL). Additionally, the tables can contain information to control the report output (e.g., the role or order of variables in the report) and the calculation of the quality indicators.

NOTE 1: in all metadata tables, the column names are written in upper case letters to distinguish them from the column names in the study data.

NOTE 2: the names of the sheets 1 to 4 are predefined. However, the names of the missing (5) and expected_IDs (6) sheets are user defined, as they must correspond to the information entered in the other sheets.

For instance, in the example metadata, in the item_level sheet, under MISSING_LIST_TABLE, there is a missing_table entry:

9 v00004 SBP_0 missing_table
10 v00005 DBP_0 missing_table
11 v00006 GLOBAL_HEALTH_VAS_0 missing_table

This means that the name of the missing (5) sheet in that same workbook is “missing_table”.

Similarly, the name of the expected_IDs (6) sheet depends on the ID_REF_TABLE defined in the segment_level or dataframe_level sheets:

study_data expected_id
lab_data expected_id
questionnaire_data d:/data/questionnaire_data.xlsx | pseudo_id

In the example above, the sheets with the expected_IDs (6) are named “expected_id” (in the current workbook) and “pseudo_id” (which is a sheet in the questionnaire_data.xlxs workbook).

Back to Tutorials

Meyer, J., Ostrzinski, S., Fredrich, D., Havemann, C., Krafczyk, J., and Hoffmann, W. (2012). Efficient data management in a large-scale epidemiology research project. Computer Methods and Programs in Biomedicine 107, 425–435.
Nadkarni, P.M. (2011). Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge (Springer Science & Business Media).
Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.