Introduction

In large epidemiological studies, data may be provided in separate data frames. In this case, data frame level metadata needs to be specified to assess the data quality of the different data frames.


Data frame (DF) level metadata for data quality reporting

Currently, the following attributes can be used by dataquieR functions (with the exception of dq_report2, which is intended for a single study data frame):


DF_NAME

This column defines the names of the data frames to be assessed. The input must be a string, referring to a data frame in the data frame cache (prep_list_dataframes).

CAVEAT dq_report2 will only find and use the data frames that are in the data frame cache.

It is always possible to refer to the single current data frame passed to the dq_report2 function indicating “study_data” in this column, independently of the actual name of the data frame.


DF_ELEMENT_COUNT

Specifies the number of expected data elements (columns) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.

As an example, the metadata for the data frame element count may contain the following information:

DF_NAME DF_ELEMENT_COUNT
study_data 53
lab_data 6
questionnaire_data 10


DF_RECORD_COUNT

Specifies the number of expected data records (rows) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.

For instance, the data frame level count metadata may be:

DF_NAME DF_RECORD_COUNT
study_data 3000
lab_data 2500
questionnaire_data 2900


DF_ID_REF_TABLE

The name of the table containing the reference IDs to be compared with the IDs in the targeted data frame. The input must be a string and can refer to a spreadsheet in the same or another workbook or an URL.

In the example below, for the data frames study_data and lab_data, the IDs are specified in the sheet called expected_ids of the same workbook. In contrast, the IDs for the questionnaire_data are provided in the pseudo_id sheet of the questionnaire_data.xlsx workbook. Since this is a different workbook, its path must be specified.

DF_NAME DF_ID_REF_TABLE
study_data expected_id
lab_data expected_id
questionnaire_data d:/data/questionnaire_data.xlsx | pseudo_id


DF_RECORD_CHECK

A string that sets the type of check to be conducted when comparing the reference ID table with the IDs in a data frame. Two assessments are possible:

  • exact: tests for an exact match between DF_ID_REF_TABLE and the IDs in DF_NAME, or
  • subset: expects that the IDs in DF_NAME are a subset of DF_ID_REF_TABLE.

For instance, the study_data may comprise all participants from a study, while particular sections, such as lab_data or questionnaire_data, may have only been collected from a smaller participant sample:

DF_NAME DF_RECORD_CHECK
study_data exact
lab_data subset
questionnaire_data subset


DF_UNIQUE_ID

Defines expectancies on the uniqueness of the IDs across the rows of a data frame or the number of times an ID can be repeated. The input must be an integer defining the number of permissible repetitions (e.g., 1 equals uniqueness or no repetitions). Enter “-1” for unknown repetitions.

In many cases, we would not expect IDs to appear more than once, for example, if study_data contains information on all study participants only once. However, in some other cases values may be measured multiple times, for instance in lab_data three values may be measured per participant. Lastly, it may not be known whether there are expected repetitions in the data, as in questionnaire_data, identified with “-1”.

DF_NAME DF_UNIQUE_ID
study_data 1
lab_data 3
questionnaire_data -1


DF_ID_VARS

Defines all variables to be used as one single ID variable (a combined key) in a data frame. The list of variables must be a string in which each variable is separated by a pipe character (|).

For example, the ID for study_data is specified in the variable “v00001”, while for lab_data is PSEUDO_ID. In some situations, the ID may be defined by a combined key specified by a list of variables, as in questionnaire_data, where the key consists of the “ID” and “exdate” variables.

DF_NAME DF_ID_VARS
study_data v00001
lab_data PSEUDO_ID
questionnaire_data id | exdate


DF_UNIQUE_ROWS

Specifies whether identical data is permitted across rows in a data frame (excluding ID variables). The input is a boolean, meaning:

  • false: allow repeated rows, or
  • true: rows must be unique.

For instance, row repetitions may be allowed for lab_data but not for study_data and questionnaire_data.

DF_NAME DF_UNIQUE_ROWS
study_data true
lab_data false
questionnaire_data true