Introduction

In an epidemiological study, data may be grouped according to the different examinations, such as laboratory, blood pressure or ultrasound measurements. The corresponding metadata to describe single segments is termed segment level.


How dataquieR uses segment level metadata

To analyze data quality at the segment level, the item level must include information about which variable corresponds to each segment in the column labelled STUDY_SEGMENT.


Segment level metadata for data quality reporting


STUDY_SEGMENT

This column includes the name of the study segment (as strings), defined for each variable.


SEGMENT_RECORD_COUNT

Specifies the number of expected data records in each study segment. The value must be an integer. The check will only be conducted if a number is entered.

For example, the data frame level count metadata may be:

STUDY_SEGMENT SEGMENT_RECORD_COUNT
STUDY 3000
PHYS_EXAM 2000
LAB 1990
INTERVIEW 3000
QUESTIONNAIRE 2981


SEGMENT_ID_TABLE

The name of the table containing the reference IDs to be compared with the IDs in the targeted segment. The input must be a string and can refer to a spreadsheet in the same or another workbook or an URL.

In the example below, for the first four segments, the IDs are specified in the sheet called expected_ids of the same workbook. In contrast, the IDs for PART_QUESTIONNAIRE are provided in the pseudo_id sheet of the questionnaire_data.xlsx workbook. Since this is a different workbook, its path must be specified.

STUDY_SEGMENT SEGMENT_ID_TABLE
STUDY expected_id
PHYS_EXAM expected_id
LAB expected_id
INTERVIEW expected_id
QUESTIONNAIRE d:/data/questionnaire_data.xlsx | pseudo_id


SEGMENT_RECORD_CHECK

A string that sets the type of check to be conducted when comparing the reference ID table with the IDs in a segment. Two checks are possible:

  • exact: tests for an exact match between SEGMENT_ID_REF_TABLE and the IDs in STUDY_SEGMENT, or
  • subset: expects that the IDs in STUDY_SEGMENT are a subset of SEGMENT_ID_TABLE.

For instance, the PART_STUDY, PART_INTERVIEW and PART_QUESTIONNAIRE may comprise all participants from a study, while particular sections, such as PART_PHYS_EXAM and PART_LAB, may have only been collected from a smaller participant sample:

STUDY_SEGMENT SEGMENT_RECORD_CHECK
STUDY exact
PHYS_EXAM subset
LAB subset
INTERVIEW exact
QUESTIONNAIRE exact


SEGMENT_ID_VARS

Defines all variables to be used as one single ID variable (a combined key) in a segment. The list of variables must be a string in which each variable is separated by a pipe character (|).

For example, the ID for PART_PHYS_EXAM is defined by a combined key specified by a list of variables, where the key consists of the “PSEUDO_ID” and “CENTER_0” variables. For the rest of the variables, the ID is specified by the variable “v00001”:

STUDY_SEGMENT SEGMENT_ID_VARS
STUDY v00001
PHYS_EXAM PSEUDO_ID | CENTER_0
LAB v00001
INTERVIEW v00001
QUESTIONNAIRE v00001


SEGMENT_UNIQUE_ROWS

Specifies whether identical data is permitted across rows in a segment (excluding ID variables). The input is a Boolean, meaning:

  • false: allow repeated rows, or
  • true: rows must be unique.

For instance, row repetitions may be allowed for PART_PHYS_EXAM and PART_LAB but not for the rest of the segments.

STUDY_SEGMENT SEGMENT_UNIQUE_ROWS
STUDY true
PHYS_EXAM false
LAB false
INTERVIEW true
QUESTIONNAIRE true

SEGMENT_PART_VARS

Provides the name of the variable that indicates participation in the respective segment. For instance:

STUDY_SEGMENT SEGMENT_PART_VARS
STUDY seg_study_part
PHYS_EXAM seg_phys_exam_part
LAB seg_lab_part
INTERVIEW seg_interview_part
QUESTIONNAIRE seg_questionnaire_part

In the study data, each segment participation variable contains participation and missing codes (e.g., -10000, 99980, 99981). If interpretation codes are provided in a separate table (e.g., segment_missing_table), the participation codes allow the calculation of qualified missingness rates per segment.