In large epidemiological studies, data may be provided in separate data frames. In this case, data frame level metadata needs to be specified to assess the data quality of the different data frames.
Currently, the following attributes can be used by dataquieR
functions (with the exception of dq_report2
, which is
intended for a single study data frame):
This column defines the names of the data frames to be assessed. The
input must be a string, referring to a data frame in the data frame
cache (prep_list_dataframes
).
CAVEAT dq_report2
will only find and
use the data frames that are in the data frame cache.
It is always possible to refer to the single current data frame
passed to the dq_report2
function indicating “study_data”
in this column, independently of the actual name of the data frame.
Specifies the number of expected data elements (columns) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.
As an example, the metadata for the data frame element count may contain the following information:
DF_NAME | DF_ELEMENT_COUNT |
---|---|
study_data | 53 |
lab_data | 6 |
questionnaire_data | 10 |
Specifies the number of expected data records (rows) in each study data frame. The value must be an integer. The check will only be conducted if a number is entered.
For instance, the data frame level count metadata may be:
DF_NAME | DF_RECORD_COUNT |
---|---|
study_data | 3000 |
lab_data | 2500 |
questionnaire_data | 2900 |
The name of the table containing the reference IDs to be compared with the IDs in the targeted data frame. The input must be a string and can refer to a spreadsheet in the same or another workbook or an URL.
In the example below, for the data frames study_data
and
lab_data
, the IDs are specified in the sheet called
expected_ids
of the same workbook. In contrast, the IDs for
the questionnaire_data
are provided in the
pseudo_id
sheet of the questionnaire_data.xlsx
workbook. Since this is a different workbook, its path must be
specified.
DF_NAME | DF_ID_REF_TABLE |
---|---|
study_data | expected_id |
lab_data | expected_id |
questionnaire_data | d:/data/questionnaire_data.xlsx | pseudo_id |
A string that sets the type of check to be conducted when comparing the reference ID table with the IDs in a data frame. Two assessments are possible:
DF_ID_REF_TABLE
and the IDs in DF_NAME
, orDF_NAME
are a subset of
DF_ID_REF_TABLE
.For instance, the study_data
may comprise all
participants from a study, while particular sections, such as
lab_data
or questionnaire_data
, may have only
been collected from a smaller participant sample:
DF_NAME | DF_RECORD_CHECK |
---|---|
study_data | exact |
lab_data | subset |
questionnaire_data | subset |
Defines expectancies on the uniqueness of the IDs across the rows of a data frame or the number of times an ID can be repeated. The input must be an integer defining the number of permissible repetitions (e.g., 1 equals uniqueness or no repetitions). Enter “-1” for unknown repetitions.
In many cases, we would not expect IDs to appear more than once, for
example, if study_data
contains information on all study
participants only once. However, in some other cases values may be
measured multiple times, for instance in lab_data
three
values may be measured per participant. Lastly, it may not be known
whether there are expected repetitions in the data, as in
questionnaire_data
, identified with “-1”.
DF_NAME | DF_UNIQUE_ID |
---|---|
study_data | 1 |
lab_data | 3 |
questionnaire_data | -1 |
Defines all variables to be used as one single ID variable (a combined key) in a data frame. The list of variables must be a string in which each variable is separated by a pipe character (|).
For example, the ID for study_data
is specified in the
variable “v00001”, while for lab_data
is
PSEUDO_ID
. In some situations, the ID may be defined by a
combined key specified by a list of variables, as in
questionnaire_data
, where the key consists of the “ID” and
“exdate” variables.
DF_NAME | DF_ID_VARS |
---|---|
study_data | v00001 |
lab_data | PSEUDO_ID |
questionnaire_data | id | exdate |
Specifies whether identical data is permitted across rows in a data frame (excluding ID variables). The input is a boolean, meaning:
For instance, row repetitions may be allowed for
lab_data
but not for study_data
and
questionnaire_data
.
DF_NAME | DF_UNIQUE_ROWS |
---|---|
study_data | true |
lab_data | false |
questionnaire_data | true |