Description

The des_summary function provides descriptive statistics for numerical and categorical variables in the study data.
Depending of the type of data, the function provides the appropriate measures of central tendency (i.e., mean, median, and mode); measures of dispersion (i.e., standard deviation, interquartile range, mean absolute deviation, range of values, and coefficient of variation); information on the number of categories and their frequency, on the shape of the distribution (skewness and kurtosis), and on missing data. It also provides plots to give an overview of the data distribution. The derived functions des_summary_categorical and des_summary_continuous only provide the appropriate descriptive statistics for variables of the matching type of data in the function name. The functions also work without metadata.

Usage and arguments

des_summary(
  resp_vars = NULL,
  study_data = sd1,
  label_col = LABEL,
  meta_data = md1
)

The function has the following arguments:

  • resp_vars: optional, a character specifying the measurement variables of interest. If missing, all variables from the study_data are assessed;
  • study_data: mandatory, the data frame containing the measurements;
  • meta_data: optional, the data frame containing the item-level metadata. If this refers to missing tables, then these have to be existing as files or URLs or be loaded before using prep_load_workbook_like_file(), prep_load_folder_with_metadata(), or prep_get_data_frame();
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

prep_load_workbook_like_file("meta_data_v2")
sd1 <- prep_get_data_frame("study_data")
des_sum <- des_summary(
  study_data = sd1,
  label_col = LABEL
)

The function generates 2 outputs SummaryData and SummaryTable, that are exactly the same in this case, but used differently in the creation of a report.

Output 1: Summary data frame

The summary data frame is called using des_sum$SummaryData:

Either as an interactive data.tables table:

des_sum

Or as a kable:

Variables Type STUDY_SEGMENT Mean SD Median Mode IQR (Quartiles) MAD Range (Min - Max) CV Skewness (SE) Kurtosis No. categories (incl.NAs) Frequency table Valid Missing Graph
1 Examination center
CENTER_0
v00000
nominal, integer STUDY Berlin 5
‘Berlin’ ‘Leipzig’ ‘Munich’ ‘Hamburg’ ‘Cologne’
632 602 597 592 577
3000 (100%) 0 (0%)
Histogram
31 Sex B/L
SEX_0
v00002
nominal, integer STUDY females 3
‘females’ ‘males’
1478 1462
2940 (98%) 60 (2%)
Histogram
42 Age B/L
AGE_0
v00003
ratio, integer STUDY 49.914 4.423 50 51 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.862 -0.037 (0.045) -0.176 2940 (98%) 60 (2%)
Histogram
15 Age group B/L
AGE_GROUP_0
v00103
ordinal, string STUDY 50-59 50-59 1 (Q1 = 40-49 | Q3 = 50-59) (30-39 - 60-69) 5
‘50-59’ ‘40-49’ ‘60-69’ ‘30-39’
1554 1322 39 25
2940 (98%) 60 (2%)
Histogram
20 Age F/U
AGE_1
v01003
ratio, integer STUDY 49.872 4.429 50 51 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.881 -0.032 (0.045) -0.186 2940 (98%) 60 (2%)
Histogram
25 Sex F/U
SEX_1
v01002
nominal, integer STUDY females 3
‘females’ ‘males’
1472 1468
2940 (98%) 60 (2%)
Histogram
43 Study consent
PART_STUDY
v10000
nominal, integer STUDY yes 2
‘yes’ ‘no’
2940 0
2940 (98%) 60 (2%)
Histogram
53 Systolic blood pressure
SBP_0
v00004
ratio, float PHYS_EXAM 126.516 9.613 127 130 13 (Q1 = 120 | Q3 = 133) 8.896 63 (97 - 160) 7.598 0.064 (0.045) -0.564 2561 (85.37%) 439 (14.63%)
Histogram
2 Diastolic blood pressure
DBP_0
v00005
ratio, float PHYS_EXAM 81.29 9.214 81 80 12 (Q1 = 75 | Q3 = 87) 8.896 57 (54 - 111) 11.335 0.081 (0.045) -0.556 2544 (84.8%) 456 (15.2%)
Histogram
3 Self-reported global health
GLOBAL_HEALTH_VAS_0
v00006
ratio, float PHYS_EXAM 5.027 2.918 5 3.2 5.075 (Q1 = 2.5 | Q3 = 7.575) 3.706 10 (0 - 10) 58.051 -0.002 (0.045) -1.437 2618 (87.27%) 382 (12.73%)
Histogram
4 Known asthma
ASTHMA_0
v00007
nominal, integer PHYS_EXAM no 3
‘no’ ‘yes’
2117 524
2641 (88.03%) 359 (11.97%)
Histogram
9 Aerobic capacity category
VO2_CAPCAT_0
v00008
ordinal, string PHYS_EXAM good excellent 3 (Q1 = excellent | Q3 = restricted) (excellent - pathological) 6
‘excellent’ ‘good’ ‘moderate’ ‘restricted’ ‘pathological’
784 647 500 380 284
2595 (86.5%) 405 (13.5%)
Histogram
10 Upper arm circumference
ARM_CIRC_0
v00009
ratio, float PHYS_EXAM 25.033 3.958 25 24 6 (Q1 = 22 | Q3 = 28) 4.448 27 (11 - 38) 15.81 -0.024 (0.045) -0.359 2657 (88.57%) 343 (11.43%)
Histogram
11 Upper arm circumference cat
ARM_CIRC_DISC_0
v00109
ordinal, integer PHYS_EXAM (20,30] (20,30] 0 (Q1 = (20,30] | Q3 = (20,30]) ((-Inf,20] - (30, Inf]) 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2071 344 218
2633 (87.77%) 367 (12.23%)
Histogram
12 Upper arm circumference device
ARM_CUFF_0
v00010
ordinal, integer PHYS_EXAM (20,30] (20,30] 0 (Q1 = (20,30] | Q3 = (20,30]) ((-Inf,20] - (30, Inf]) 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2015 351 257
2623 (87.43%) 377 (12.57%)
Histogram
13 Aerobic capacity examiner
USR_VO2_0
v00011
nominal, string PHYS_EXAM USR_321 16
‘USR_321’ ‘USR_590’ ‘USR_213’ ‘USR_592’ ‘USR_211’ Others
449 301 223 223 216 1370
2782 (92.73%) 218 (7.27%)
Histogram
14 Blood pressure examiner
USR_BP_0
v00012
nominal, string PHYS_EXAM USR_301 16
‘USR_301’ ‘USR_243’ ‘USR_537’ ‘USR_542’ ‘USR_123’ Others
448 347 319 208 201 1252
2775 (92.5%) 225 (7.5%)
Histogram
16 Examination date and time
EXAM_DT_0
v00013
interval, datetime PHYS_EXAM 2018-07-02 10:09:59 UTC 3.45 months 2018-07-05 19:45:30 UTC 2018-03-21 20:44:05 UTC and other 4 dates 5.84 months (Q1 = 2018-04-03 04:24:05 UTC | Q3 = 2018-09-27 19:32:04 UTC) 4.38 months 11 months, 4 weeks, 2 days (2018-01-01 UTC - 2018-12-31 UTC) 2940 (98%) 60 (2%)
Histogram
18 Physical exam consent
PART_PHYS_EXAM
v20000
nominal, integer PHYS_EXAM yes 2
‘yes’ ‘no’
2940 0
2940 (98%) 60 (2%)
Histogram
19 C-reactive protein
CRP_0
v00014
ratio, float LAB 2.888 1.805 2.587 0.16 2.27 (Q1 = 1.608 | Q3 = 3.878) 1.637 11.894 (0.118 - 12.012) 62.507 0.897 (0.045) 0.998 2699 (89.97%) 301 (10.03%)
Histogram
21 Erythrocyte sedimentation rate
BSG_0
v00015
ratio, float LAB 14.857 12.135 11 10 14 (Q1 = 6 | Q3 = 20) 10.378 96 (0 - 96) 81.677 1.377 (0.045) 2.678 2686 (89.53%) 314 (10.47%)
Histogram
22 Device ID
DEV_NO_0
v00016
nominal, integer LAB 2 6
‘2’ ‘3’ ‘1’ ‘4’ ‘5’
661 626 593 412 400
2692 (89.73%) 308 (10.27%)
Histogram
23 Lab analysis date and time
LAB_DT_0
v00017
interval, datetime LAB 2018-07-02 12:00:36 UTC 3.45 months 2018-07-05 21:50:00 UTC 2018-01-01 02:00:00 UTC and other 60 dates 5.83 months (Q1 = 2018-04-03 06:21:05 UTC | Q3 = 2018-09-27 20:03:04 UTC) 4.38 months 11 months, 4 weeks, 2 days, 1 minute (2018-01-01 02:00:00 UTC - 2018-12-31 02:01:00 UTC) 2940 (98%) 60 (2%)
Histogram
24 Lab analysis consent
PART_LAB
v30000
nominal, integer LAB yes 2
‘yes’ ‘no’
2940 0
2940 (98%) 60 (2%)
Histogram
26 Highest educational level B/L
EDUCATION_0
v00018
ordinal, integer INTERVIEW uppersecond uppersecond 2 (Q1 = secondary | Q3 = postsecond) (pre-primary - secondtertiary) 8
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
568 557 432 376 258 281
2472 (82.4%) 528 (17.6%)
Histogram
28 Highest educational level F/U
EDUCATION_1
v01018
ordinal, integer INTERVIEW uppersecond uppersecond 2 (Q1 = secondary | Q3 = postsecond) (pre-primary - secondtertiary) 8
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
562 541 428 360 260 271
2422 (80.73%) 578 (19.27%)
Histogram
29 Marital status
FAM_STAT_0
v00019
nominal, integer INTERVIEW NA 1
‘single’ ‘married’ ‘divorced’ ‘widowed’
0 0 0 0
0 (0%) 3000 (100%)
Histogram
30 Currently married
MARRIED_0
v00020
nominal, integer INTERVIEW no 3
‘no’ ‘yes’
2005 361
2366 (78.87%) 634 (21.13%)
Histogram
32 Number of children
N_CHILD_0
v00021
ratio, integer INTERVIEW 2.499 1.53 2 2 2 (Q1 = 1 | Q3 = 3) 1.483 9 (0 - 9) 61.21 0.491 (0.045) -0.291 2336 (77.87%) 664 (22.13%)
Histogram
33 Eating preferences
EATING_PREFS_0
v00022
nominal, integer INTERVIEW none 4
‘none’ ‘vegetarian’ ‘vegan’
1366 712 250
2328 (77.6%) 672 (22.4%)
Histogram
34 Meat consumption
MEAT_CONS_0
v00023
ordinal, integer INTERVIEW 1-2d a week never 2 (Q1 = never | Q3 = 3-4d a week) (never - daily) 6
‘never’ ‘3-4d a week’ ‘1-2d a week’ ‘5-6d a week’ ‘daily’
946 475 382 326 173
2302 (76.73%) 698 (23.27%)
Histogram
35 Current smoker
SMOKING_0
v00024
nominal, integer INTERVIEW no 3
‘no’ ‘yes’
1585 707
2292 (76.4%) 708 (23.6%)
Histogram
36 Purchasing tobacco products
SMOKE_SHOP_0
v00025
ordinal, integer INTERVIEW 3-4d a week 3-4d a week 2 (Q1 = 1-2d a week | Q3 = 5-6d a week) (never - daily) 6
‘3-4d a week’ ‘5-6d a week’ ‘daily’ ‘1-2d a week’ ‘never’
176 169 154 150 133
782 (26.07%) 2218 (73.93%)
Histogram
37 Number of injuries
N_INJURIES_0
v00026
ratio, integer INTERVIEW 4.588 2.422 4 4 3 (Q1 = 3 | Q3 = 6) 2.965 14 (0 - 14) 52.792 0.474 (0.045) -0.572 2199 (73.3%) 801 (26.7%)
Histogram
38 Number of births
N_BIRTH_0
v00027
ratio, integer INTERVIEW 3.458 1.771 3 3 2 (Q1 = 2 | Q3 = 4) 1.483 12 (0 - 12) 51.212 0.211 (0.045) -1.703 1099 (36.63%) 1901 (63.37%)
Histogram
39 Income group
INCOME_GROUP_0
v00028
ordinal, integer INTERVIEW [30-50k) [30-50k) 2 (Q1 = [10-30k) | Q3 = [50-70k)) (below 10k - above 90k) 7
‘[30-50k)’ ‘[10-30k)’ ‘[50-70k)’ ‘below 10k’ ‘[70-90k)’ Others
622 584 395 294 205 74
2174 (72.47%) 826 (27.53%)
Histogram
40 Currently pregnant
PREGNANT_0
v00029
nominal, integer INTERVIEW no 3
‘no’ ‘yes’
996 69
1065 (35.5%) 1935 (64.5%)
Histogram
41 Medication use
MEDICATION_0
v00030
nominal, integer INTERVIEW yes 2
‘yes’ ‘no’
292 0
292 (9.73%) 2708 (90.27%)
Histogram
44 Number of ATC codes
N_ATC_CODES_0
v00031
ratio, integer INTERVIEW 2.262 2.726 1 0 3 (Q1 = 0 | Q3 = 3) 1.483 22 (0 - 22) 120.479 1.406 (0.045) 3.055 2058 (68.6%) 942 (31.4%)
Histogram
45 Sociodemographics examiner
USR_SOCDEM_0
v00032
nominal, string INTERVIEW USR_321 15
‘USR_321’ ‘USR_247’ ‘USR_520’ ‘USR_125’ ‘USR_492’ Others
380 297 290 159 147 841
2114 (70.47%) 886 (29.53%)
Histogram
46 Interview date and time
INT_DT_0
v00033
interval, datetime INTERVIEW 2018-07-02 12:40:34 UTC 3.45 months 2018-07-05 22:27:30 UTC 2018-08-15 22:55:34 UTC 2018-08-24 01:57:42 UTC 5.84 months (Q1 = 2018-04-03 06:44:50 UTC | Q3 = 2018-09-27 22:05:04 UTC) 4.38 months 11 months, 4 weeks, 1 day, 23 hours, 56 minutes (2018-01-01 02:24:00 UTC - 2018-12-31 02:20:00 UTC) 2940 (98%) 60 (2%)
Histogram
47 Interview consent
PART_INTERVIEW
v40000
nominal, integer INTERVIEW yes 3
‘yes’ ‘no’
2924 16
2940 (98%) 60 (2%)
Histogram
48 Item 1
ITEM_1_0
v00034
ratio, integer QUESTIONNAIRE 3.037 1.764 3 3 2 (Q1 = 2 | Q3 = 4) 1.483 9 (0 - 9) 58.085 0.414 (0.045) -0.589 2248 (74.93%) 752 (25.07%)
Histogram
49 Item 2
ITEM_2_0
v00035
ratio, integer QUESTIONNAIRE 2.988 1.701 3 2 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.928 0.408 (0.045) -0.648 2197 (73.23%) 803 (26.77%)
Histogram
50 Item 3
ITEM_3_0
v00036
ratio, integer QUESTIONNAIRE 3.014 1.718 3 3 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.99 0.41 (0.045) -0.606 2184 (72.8%) 816 (27.2%)
Histogram
51 Item 4
ITEM_4_0
v00037
ratio, integer QUESTIONNAIRE 3 1.721 3 3 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 57.37 0.435 (0.045) -0.556 2143 (71.43%) 857 (28.57%)
Histogram
52 Item 5
ITEM_5_0
v00038
ratio, integer QUESTIONNAIRE 6.021 2.374 6 6 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.424 -0.183 (0.045) -1.352 2074 (69.13%) 926 (30.87%)
Histogram
5 Item 6
ITEM_6_0
v00039
ratio, integer QUESTIONNAIRE 5.948 2.371 6 6 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.857 -0.119 (0.045) -1.414 2048 (68.27%) 952 (31.73%)
Histogram
7 Item 7
ITEM_7_0
v00040
ratio, integer QUESTIONNAIRE 6.037 2.401 6 6 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.769 -0.174 (0.045) -1.398 2068 (68.93%) 932 (31.07%)
Histogram
6 Item 8
ITEM_8_0
v00041
ratio, integer QUESTIONNAIRE 5.895 2.397 6 6 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 40.67 -0.097 (0.045) -1.487 2013 (67.1%) 987 (32.9%)
Histogram
27 Questionnaire date and time
QUEST_DT_0
v00042
interval, datetime QUESTIONNAIRE 2018-08-03 23:09:06 UTC 4.59 months 2018-07-29 21:52:59 UTC 2017-12-31 22:59:59 UTC 6.31 months (Q1 = 2018-04-20 11:41:02 UTC | Q3 = 2018-10-29 13:49:41 UTC) 4.66 months 2 years, 10 months, 1 week, 14 hours, 43 minutes, 29.47 seconds (2017-12-31 22:59:59 UTC - 2020-11-08 13:43:28 UTC) 2940 (98%) 60 (2%)
Histogram
8 Questionnaire consent
PART_QUESTIONNAIRE
v50000
nominal, integer QUESTIONNAIRE yes 3
‘yes’ ‘no’
2864 76
2940 (98%) 60 (2%)
Histogram


Interpretation

Algorithm of the implementation

  1. From the metadata, determine the scale level and the data type of each variable.
  2. For variables of scale level nominal, calculate the mode, and the number of categories and their frequency.
  3. For variables of scale level ordinal, calculate the median, the mode, the range of values (min - max), the interquartile range (1st and 3rd quartiles), and the number of categories and their frequency
  4. For variables of scale level interval, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  5. For variables of scale level ratio, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  6. Generate a data frame containing the summary for each variable.

Concept relations