Metadata is considered as “data that describe other data” (Nadkarni 2011). Metadata provides information to support the correct interpretation of study data and guide data quality (DQ) assessments as well as statistical analyses. Metadata are, for example, lists of value codes to examine reasons for incomplete data or value labels to support interpretable reports. Some metadata is specific for certain DQ assessments, while others will be used across most DQ implementations.
Metadata is commonly stored in data dictionaries (DDs). DDs
frequently contain the name of a variable, its data type, and, if
applicable, labels for the levels of a categorical variable (Meyer et al. 2012). DDs should be available for
the study data of each research study. However, DDs often host only a
subset of all information necessary for data quality assessments. Thus,
DDs need to be extended on aspects related to data quality. If this is
not possible, metadata may also be stored in a spreadsheet-type format,
such as data frames. dataquieR
uses predefined metadata
provided as data frames, as described below.
dataquieR
uses metadataThe metadata schema used by dataquieR is based on a formal data quality framework for observational studies (Schmidt et al. 2021). dataquieR uses metadata that has been organized in a structured form across tables or sheets:
Each metadata table is arranged as a spreadsheet in a workbook to facilitate user input. Users can provide metadata directly in the spreadsheet or by specifying the source file for a specific item (e.g., another spreadsheet or an URL). Additionally, the tables can contain information to control the report output (e.g., the role or order of variables in the report) and the calculation of the quality indicators.
NOTE 1: in all metadata tables, the column names are written in upper case letters to distinguish them from the column names in the study data.
NOTE 2: the names of the sheets 1 to 4 are predefined. However, the names of the missing (5) and expected_IDs (6) sheets are user defined, as they must correspond to the information entered in the other sheets.
For instance, in the example metadata, in the item_level sheet, under
MISSING_LIST_TABLE
, there is a missing_table
entry:
VAR_NAMES | LABEL | MISSING_LIST_TABLE | |
---|---|---|---|
9 | v00004 | SBP_0 | missing_table |
10 | v00005 | DBP_0 | missing_table |
11 | v00006 | GLOBAL_HEALTH_VAS_0 | missing_table |
This means that the name of the missing (5) sheet in that same workbook is “missing_table”.
Similarly, the name of the expected_IDs (6) sheet depends on the
ID_REF_TABLE
defined in the segment_level or
dataframe_level sheets:
DF_NAME | DF_ID_REF_TABLE |
---|---|
study_data | expected_id |
lab_data | expected_id |
questionnaire_data | d:/data/questionnaire_data.xlsx | pseudo_id |
In the example above, the sheets with the expected_IDs (6) are named “expected_id” (in the current workbook) and “pseudo_id” (which is a sheet in the questionnaire_data.xlxs workbook).
NOTE 3 (for advanced users): There is a convention
for tables mentioned inside metadata columns ending with the suffix
__“_TABLE”__. These tables are automatically imported with the function
prep_get_data_frame()
whenever needed by a function. They
are also stored in the data frames storage area of dataquieR with all
the other metadata tables. For example, this happens for the tables
mentioned in: the item_level metadata columns:
STANDARDIZED_VOCABULARY_TABLE
,
MISSING_LIST_TABLE
, VALUE_LABEL_TABLE
; the
segment_level metadata column SEGMENT_ID_TABLE
; and the
dataframe_level metadata column DF_ID_REF_TABLE
.
Use prep_list_dataframes()
to see the content of this
storage area and look here for more
information on how to revise the content of the dataquieR data frame
storage area.