The Data
type mismatch indicator be calculated using int_datatype_matrix
in the following way:
# Load dataquieR
library(dataquieR)
# Load data
sd1 <- prep_get_data_frame("ship")
# Load metadata
file_name <- system.file("extdata", "ship_meta_v2.xlsx", package = "dataquieR")
prep_load_workbook_like_file(file_name)
meta_data_item <- prep_get_data_frame("item_level") # item_level is a sheet in ship_meta_v2.xlsx
# Apply indicator function
datatype_res <- int_datatype_matrix(
study_data = sd1,
meta_data = meta_data_item,
label_col = "LONG_LABEL"
)
A plot and a table are provided to view the results:
datatype_res$SummaryPlot
datatype_res$SummaryData
Variables | Data type mismatch N (%) | Convertible mismatch, stable N (%) | Convertible mismatch, unstable N (%) | Data type match N (%) | Expected DATA_TYPE | Observed DATA_TYPE | State, given threshold | STUDY_SEGMENT | |
---|---|---|---|---|---|---|---|---|---|
25 | Participant ID | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
10 | Diastolic blood pressure 2 | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
28 | Somatometry examiner | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
5 | Body height | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
6 | Body height scale ID | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
7 | Body weight | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | float | float | Matching datatype | SOMATOMETRY |
8 | Body weight scale ID | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
33 | Waist circumference | 300 (0.14%) | 2148 ( 99.72%) | 0 (0.00%) | 3 ( 0.14%) | float | string | Non-matching datatype | SOMATOMETRY |
17 | Interview examiner | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
16 | Highest educational level | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
23 | Marital status | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
14 | Examination date and time | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | datetime | datetime | Matching datatype | INTRO |
27 | Smoking status | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
12 | Ever had stroke | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
11 | Ever had myocardial infarction | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
20 | Known diabetes | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
2 | Age of diabetes onset | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
13 | Ever taken birth control pills | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
24 | Monthly household income | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTERVIEW |
15 | HDL-cholesterol | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | float | float | Matching datatype | LABORATORY |
22 | LDL-cholesterol | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | float | float | Matching datatype | LABORATORY |
32 | Total cholesterol | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | float | float | Matching datatype | LABORATORY |
26 | Sex | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
19 | Intro segment consent | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
29 | Somatometry segment consent | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
18 | Interview segment consent | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
21 | Laboratory segment consent | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
1 | Age | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | INTRO |
4 | Blood pressure examiner | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
3 | Blood pressure device ID | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
30 | Systolic blood pressure 1 | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
31 | Systolic blood pressure 2 | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
9 | Diastolic blood pressure 1 | 0 (0.00%) | 0 ( 0.00%) | 0 (0.00%) | 2154 (100.00%) | integer | integer | Matching datatype | SOMATOMETRY |
All datatype issues found by int_datatype_matrix
should be checked data element by data element. For instance, a major
issue was found in the variable WAIST_CIRC_0. This variable is
in the study data with datatype character, which differs from
the expected datatype float defined in the metadata. Some basic
checks show the misuse of commas as the decimal delimiter.
int_inspect_char(sd1$waist)
Character | Count |
---|---|
, | 3 |
. | 2144 |
0 | 933 |
1 | 1443 |
2 | 908 |
3 | 884 |
4 | 889 |
5 | 898 |
6 | 1018 |
7 | 1279 |
8 | 1355 |
9 | 1409 |
NA | 3 |
To correct this issue, converting WAIST_CIRC_0 to datatype
numeric will coerce respective values to NA
’s,
which should be avoided. Hence, we replace the comma with the correct
delimiter and correct the datatype without losing data values. The
resulting applicability plot shows no more issues.
# replace comma with the correct delimiter
sd1$waist <- as.numeric(gsub(",", ".", sd1$waist))
int_datatype_matrix(
study_data = sd1,
meta_data = meta_data_item,
label_col = "LONG_LABEL"
)$SummaryPlot