Description

The des_summary function provides descriptive statistics for numerical and categorical variables in the study data.
Depending of the type of data, the function provides the appropriate measures of central tendency (i.e., mean, median, and mode); measures of dispersion (i.e., standard deviation, interquartile range, mean absolute deviation, range of values, and coefficient of variation); information on the number of categories and their frequency, on the shape of the distribution (skewness and kurtosis), and on missing data. It also provides plots to give an overview of the data distribution.

Usage and arguments

des_summary(
  resp_vars = NULL,
  study_data = sd1,
  label_col = LABEL,
  meta_data = md1
)

The function has the following arguments:

  • resp_vars: optional, a character specifying the measurement variables of interest. If missing, all variables from the study_data are assessed;
  • study_data: mandatory, the data frame containing the measurements;
  • meta_data: mandatory, the data frame containing the item-level metadata. If this refers to missing tables, then these have to be existing as files or URLs or be loaded before using prep_load_workbook_like_file(), prep_load_folder_with_metadata(), or prep_get_data_frame();
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

prep_load_workbook_like_file("meta_data_v2")
sd1 <- prep_get_data_frame("study_data")
des_sum <- des_summary(
  study_data = sd1,
  label_col = LABEL
)

The function generates 2 outputs SummaryData and SummaryTable, that are exactly the same in this case, but used differently in the creation of a report.

Output 1: Summary data frame

The summary data frame is called using des_sum$SummaryData:

Either as an interactive data.tables table:

DT::datatable(des_sum$SummaryData, escape = FALSE)

Or as a kable:

Variables Labels STUDY_SEGMENT Mean Median Mode SD IQR (Quartiles) MAD Range (Min - Max) CV Skewness Kurtosis No. categories/Freq. table Valid Missing Graph
CENTER_0 Examination center
CENTER_0
v00000
[nominal, integer]
STUDY Berlin No. unique values (incl. NA): 5
‘Berlin’ ‘Leipzig’ ‘Munich’ ‘Hamburg’ ‘Cologne’
632 602 597 592 577
3000
100.000%
0
0.000%
Histogram of Examination center
DBP_0 Diastolic blood pressure
DBP_0
v00005
[ratio, float]
PHYS_EXAM 81.29 81 80 9.2142 12 (Q1 = 75 | Q3 = 87) 8.896 57 (54 - 111) 11.335 0.0805 (0.0447) -0.5562 2544
84.800%
456
15.200%
Histogram of Diastolic blood pressure
GLOBAL_HEALTH_VAS_0 Self-reported global health
GLOBAL_HEALTH_VAS_0
v00006
[ratio, float]
PHYS_EXAM 5.027 5 3.2 2.9184 5.075 (Q1 = 2.5 | Q3 = 7.575) 3.706 10 (0 - 10) 58.0511 -0.0015 (0.0447) -1.4368 2618
87.267%
382
12.733%
Histogram of Self-reported global health
ASTHMA_0 Known asthma
ASTHMA_0
v00007
[nominal, integer]
PHYS_EXAM no No. unique values (incl. NA): 3
‘no’ ‘yes’
2117 524
2641
88.033%
359
11.967%
Histogram of Known asthma
VO2_CAPCAT_0 Aerobic capacity category
VO2_CAPCAT_0
v00008
[ordinal, string]
PHYS_EXAM good excellent 3 (Q1 = excellent | Q3 = restricted) (excellent - pathological) No. unique values (incl. NA): 6
‘excellent’ ‘good’ ‘moderate’ ‘restricted’ ‘pathological’
784 647 500 380 284
0
0.000%
405
13.500%
Histogram of Aerobic capacity category
ARM_CIRC_0 Upper arm circumference
ARM_CIRC_0
v00009
[ratio, float]
PHYS_EXAM 25.033 25 24 3.9576 6 (Q1 = 22 | Q3 = 28) 4.448 27 (11 - 38) 15.8097 -0.0237 (0.0447) -0.3594 2657
88.567%
343
11.433%
Histogram of Upper arm circumference
ARM_CIRC_DISC_0 Upper arm circumference cat
ARM_CIRC_DISC_0
v00109
[ordinal, integer]
PHYS_EXAM (20,30] (20,30] 0 (Q1 = (20,30] | Q3 = (20,30]) ((-Inf,20] - (30, Inf]) No. unique values (incl. NA): 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2071 344 218
2633
87.767%
367
12.233%
Histogram of Upper arm circumference cat
ARM_CUFF_0 Upper arm circumference device
ARM_CUFF_0
v00010
[ordinal, integer]
PHYS_EXAM (20,30] (20,30] 0 (Q1 = (20,30] | Q3 = (20,30]) ((-Inf,20] - (30, Inf]) No. unique values (incl. NA): 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2015 351 257
2623
87.433%
377
12.567%
Histogram of Upper arm circumference device
USR_VO2_0 Aerobic capacity examiner
USR_VO2_0
v00011
[nominal, string]
PHYS_EXAM USR_321 No. unique values (incl. NA): 16
‘USR_321’ ‘USR_590’ ‘USR_213’ ‘USR_592’ ‘USR_211’ Others
449 301 223 223 216 1370
0
0.000%
218
7.267%
Histogram of Aerobic capacity examiner
USR_BP_0 Blood pressure examiner
USR_BP_0
v00012
[nominal, string]
PHYS_EXAM USR_301 No. unique values (incl. NA): 16
‘USR_301’ ‘USR_243’ ‘USR_537’ ‘USR_542’ ‘USR_123’ Others
448 347 319 208 201 1252
0
0.000%
225
7.500%
Histogram of Blood pressure examiner
EXAM_DT_0 Examination date and time
EXAM_DT_0
v00013
[interval, datetime]
PHYS_EXAM 2018-07-02 10:09:59 UTC 2018-07-05 19:45:30 UTC 2018-03-21 20:44:05 UTC 2018-04-12 05:25:04 UTC and other 3 dates 9064795 secs 15347279 secs (Q1 = 2018-04-03 04:24:05 UTC | Q3 = 2018-09-27 19:32:04 UTC) 11520753 secs 364 days (2018-01-01 UTC - 2018-12-31 UTC) 2940
98.000%
60
2.000%
Histogram of Examination date and time
PART_PHYS_EXAM Physical exam consent
PART_PHYS_EXAM
v20000
[nominal, integer]
PHYS_EXAM yes No. unique values (incl. NA): 2
‘yes’
2940
2940
98.000%
60
2.000%
Histogram of Physical exam consent
CRP_0 C-reactive protein
CRP_0
v00014
[ratio, float]
LAB 2.888 2.587 0.16 1.8053 2.27 (Q1 = 1.608 | Q3 = 3.878) 1.637 11.894 (0.118 - 12.012) 62.5065 0.8966 (0.0447) 0.9983 2699
89.967%
301
10.033%
Histogram of C-reactive protein
BSG_0 Erythrocyte sedimentation rate
BSG_0
v00015
[ratio, float]
LAB 14.857 11 10 12.1348 14 (Q1 = 6 | Q3 = 20) 10.378 96 (0 - 96) 81.6771 1.3774 (0.0447) 2.678 2686
89.533%
314
10.467%
Histogram of Erythrocyte sedimentation rate
DEV_NO_0 Device ID
DEV_NO_0
v00016
[nominal, integer]
LAB 2 No. unique values (incl. NA): 6
‘2’ ‘3’ ‘1’ ‘4’ ‘5’
661 626 593 412 400
2692
89.733%
308
10.267%
Histogram of Device ID
LAB_DT_0 Lab analysis date and time
LAB_DT_0
v00017
[interval, datetime]
LAB 2018-07-02 12:00:36 UTC 2018-07-05 21:50:00 UTC 2018-01-01 02:00:00 UTC 2018-01-10 15:55:28 UTC and other 59 dates 9064818 secs 15342119 secs (Q1 = 2018-04-03 06:21:05 UTC | Q3 = 2018-09-27 20:03:04 UTC) 192027.4 mins 364.0007 days (2018-01-01 02:00:00 UTC - 2018-12-31 02:01:00 UTC) 2940
98.000%
60
2.000%
Histogram of Lab analysis date and time
PART_LAB Lab analysis consent
PART_LAB
v30000
[nominal, integer]
LAB yes No. unique values (incl. NA): 2
‘yes’
2940
2940
98.000%
60
2.000%
Histogram of Lab analysis consent
EDUCATION_0 Highest educational level B/L
EDUCATION_0
v00018
[ordinal, integer]
INTERVIEW uppersecond uppersecond 2 (Q1 = secondary | Q3 = postsecond) (pre-primary - secondtertiary) No. unique values (incl. NA): 8
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
568 557 432 376 258 281
2472
82.400%
528
17.600%
Histogram of Highest educational level B/L
EDUCATION_1 Highest educational level F/U
EDUCATION_1
v01018
[ordinal, integer]
INTERVIEW uppersecond uppersecond 2 (Q1 = secondary | Q3 = postsecond) (pre-primary - secondtertiary) No. unique values (incl. NA): 8
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
562 541 428 360 260 271
2422
80.733%
578
19.267%
Histogram of Highest educational level F/U
FAM_STAT_0 Marital status
FAM_STAT_0
v00019
[nominal, integer]
INTERVIEW 1 No. unique values (incl. NA): 5
‘1’ ‘2’ ‘0’ ‘3’
845 721 588 235
2389
79.633%
611
20.367%
Histogram of Marital status
MARRIED_0 Currently married
MARRIED_0
v00020
[nominal, integer]
INTERVIEW no No. unique values (incl. NA): 3
‘no’ ‘yes’
2005 361
2366
78.867%
634
21.133%
Histogram of Currently married
SEX_0 Sex B/L
SEX_0
v00002
[nominal, integer]
STUDY females No. unique values (incl. NA): 3
‘females’ ‘males’
1478 1462
2940
98.000%
60
2.000%
Histogram of Sex B/L
N_CHILD_0 Number of children
N_CHILD_0
v00021
[ratio, integer]
INTERVIEW 2.499 2 2 1.5297 2 (Q1 = 1 | Q3 = 3) 1.483 9 (0 - 9) 61.2097 0.4907 (0.0447) -0.2911 2336
77.867%
664
22.133%
Histogram of Number of children
EATING_PREFS_0 Eating preferences
EATING_PREFS_0
v00022
[nominal, integer]
INTERVIEW none No. unique values (incl. NA): 4
‘none’ ‘vegetarian’ ‘vegan’
1366 712 250
2328
77.600%
672
22.400%
Histogram of Eating preferences
MEAT_CONS_0 Meat consumption
MEAT_CONS_0
v00023
[ordinal, integer]
INTERVIEW 1-2d a week never 2 (Q1 = never | Q3 = 3-4d a week) (never - daily) No. unique values (incl. NA): 6
‘never’ ‘3-4d a week’ ‘1-2d a week’ ‘5-6d a week’ ‘daily’
946 475 382 326 173
2302
76.733%
698
23.267%
Histogram of Meat consumption
SMOKING_0 Current smoker
SMOKING_0
v00024
[nominal, integer]
INTERVIEW no No. unique values (incl. NA): 3
‘no’ ‘yes’
1585 707
2292
76.400%
708
23.600%
Histogram of Current smoker
SMOKE_SHOP_0 Purchasing tobacco products
SMOKE_SHOP_0
v00025
[ordinal, integer]
INTERVIEW 3-4d a week 3-4d a week 2 (Q1 = 1-2d a week | Q3 = 5-6d a week) (never - daily) No. unique values (incl. NA): 6
‘3-4d a week’ ‘5-6d a week’ ‘daily’ ‘1-2d a week’ ‘never’
176 169 154 150 133
782
26.067%
2218
73.933%
Histogram of Purchasing tobacco products
N_INJURIES_0 Number of injuries
N_INJURIES_0
v00026
[ratio, integer]
INTERVIEW 4.588 4 4 2.4221 3 (Q1 = 3 | Q3 = 6) 2.965 14 (0 - 14) 52.7921 0.4743 (0.0447) -0.5724 2199
73.300%
801
26.700%
Histogram of Number of injuries
N_BIRTH_0 Number of births
N_BIRTH_0
v00027
[ratio, integer]
INTERVIEW 3.458 3 3 1.7707 2 (Q1 = 2 | Q3 = 4) 1.483 12 (0 - 12) 51.2115 0.2105 (0.0447) -1.7027 1099
36.633%
1901
63.367%
Histogram of Number of births
INCOME_GROUP_0 Income group
INCOME_GROUP_0
v00028
[ordinal, integer]
INTERVIEW [30-50k) [30-50k) 2 (Q1 = [10-30k) | Q3 = [50-70k)) (below 10k - above 90k) No. unique values (incl. NA): 7
‘[30-50k)’ ‘[10-30k)’ ‘[50-70k)’ ‘below 10k’ ‘[70-90k)’ Others
622 584 395 294 205 74
2174
72.467%
826
27.533%
Histogram of Income group
PREGNANT_0 Currently pregnant
PREGNANT_0
v00029
[nominal, integer]
INTERVIEW no No. unique values (incl. NA): 3
‘no’ ‘yes’
996 69
1065
35.500%
1935
64.500%
Histogram of Currently pregnant
MEDICATION_0 Medication use
MEDICATION_0
v00030
[nominal, integer]
INTERVIEW yes No. unique values (incl. NA): 2
‘yes’
292
292
9.733%
2708
90.267%
Histogram of Medication use
AGE_0 Age B/L
AGE_0
v00003
[ratio, integer]
STUDY 49.914 50 51 4.4232 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.8616 -0.0367 (0.0447) -0.1761 2940
98.000%
60
2.000%
Histogram of Age B/L
N_ATC_CODES_0 Number of ATC codes
N_ATC_CODES_0
v00031
[ratio, integer]
INTERVIEW 2.262 1 0 2.7257 3 (Q1 = 0 | Q3 = 3) 1.483 22 (0 - 22) 120.4786 1.4057 (0.0447) 3.0554 2058
68.600%
942
31.400%
Histogram of Number of ATC codes
USR_SOCDEM_0 Sociodemographics examiner
USR_SOCDEM_0
v00032
[nominal, string]
INTERVIEW USR_321 No. unique values (incl. NA): 16
‘USR_321’ ‘USR_247’ ‘USR_520’ ‘USR_120’ ‘USR_125’ Others
380 297 290 172 159 988
0
0.000%
714
23.800%
Histogram of Sociodemographics examiner
INT_DT_0 Interview date and time
INT_DT_0
v00033
[interval, datetime]
INTERVIEW 2018-07-02 12:40:34 UTC 2018-07-05 22:27:30 UTC 2018-08-15 22:55:34 UTC 2018-08-24 01:57:42 UTC 9064782 secs 15348014 secs (Q1 = 2018-04-03 06:44:50 UTC | Q3 = 2018-09-27 22:05:04 UTC) 11522577 secs 363.9972 days (2018-01-01 02:24:00 UTC - 2018-12-31 02:20:00 UTC) 2940
98.000%
60
2.000%
Histogram of Interview date and time
PART_INTERVIEW Interview consent
PART_INTERVIEW
v40000
[nominal, integer]
INTERVIEW yes No. unique values (incl. NA): 3
‘yes’ ‘no’
2924 16
2940
98.000%
60
2.000%
Histogram of Interview consent
ITEM_1_0 Item 1
ITEM_1_0
v00034
[ratio, integer]
QUESTIONNAIRE 3.037 3 3 1.764 2 (Q1 = 2 | Q3 = 4) 1.483 9 (0 - 9) 58.0849 0.4138 (0.0447) -0.5894 2248
74.933%
752
25.067%
Histogram of Item 1
ITEM_2_0 Item 2
ITEM_2_0
v00035
[ratio, integer]
QUESTIONNAIRE 2.988 3 2 1.7008 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.9277 0.4079 (0.0447) -0.648 2197
73.233%
803
26.767%
Histogram of Item 2
ITEM_3_0 Item 3
ITEM_3_0
v00036
[ratio, integer]
QUESTIONNAIRE 3.014 3 3 1.7175 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.9898 0.4105 (0.0447) -0.6058 2184
72.800%
816
27.200%
Histogram of Item 3
ITEM_4_0 Item 4
ITEM_4_0
v00037
[ratio, integer]
QUESTIONNAIRE 3 3 3 1.7214 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 57.3701 0.4347 (0.0447) -0.5563 2143
71.433%
857
28.567%
Histogram of Item 4
ITEM_5_0 Item 5
ITEM_5_0
v00038
[ratio, integer]
QUESTIONNAIRE 6.021 6 6 2.3738 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.4237 -0.1831 (0.0447) -1.3516 2074
69.133%
926
30.867%
Histogram of Item 5
ITEM_6_0 Item 6
ITEM_6_0
v00039
[ratio, integer]
QUESTIONNAIRE 5.948 6 6 2.3706 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.8567 -0.1194 (0.0447) -1.4139 2048
68.267%
952
31.733%
Histogram of Item 6
AGE_GROUP_0 Age group B/L
AGE_GROUP_0
v00103
[ordinal, string]
STUDY 50-59 50-59 No. unique values (incl. NA): 5
‘50-59’ ‘40-49’ ‘60-69’ ‘30-39’
1554 1322 39 25
0
0.000%
60
2.000%
Histogram of Age group B/L
ITEM_7_0 Item 7
ITEM_7_0
v00040
[ratio, integer]
QUESTIONNAIRE 6.037 6 6 2.4007 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.7687 -0.1739 (0.0447) -1.3982 2068
68.933%
932
31.067%
Histogram of Item 7
ITEM_8_0 Item 8
ITEM_8_0
v00041
[ratio, integer]
QUESTIONNAIRE 5.895 6 6 2.3974 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 40.6699 -0.0971 (0.0447) -1.4874 2013
67.100%
987
32.900%
Histogram of Item 8
QUEST_DT_0 Questionnaire date and time
QUEST_DT_0
v00042
[interval, datetime]
QUESTIONNAIRE 2018-08-04 14:59:47 UTC 2018-07-30 08:21:46 UTC 2018-03-17 01:21:08 UTC 2018-04-07 19:44:05 UTC and other 4 dates 12050776 secs 16542372 secs (Q1 = 2018-04-21 09:39:33 UTC | Q3 = 2018-10-29 20:45:45 UTC) 12215994 secs 1032.387 days (2018-01-11 04:26:00 UTC - 2020-11-08 13:43:28 UTC) 2931
97.700%
69
2.300%
Histogram of Questionnaire date and time
PART_QUESTIONNAIRE Questionnaire consent
PART_QUESTIONNAIRE
v50000
[nominal, integer]
QUESTIONNAIRE yes No. unique values (incl. NA): 3
‘yes’ ‘no’
2864 76
2940
98.000%
60
2.000%
Histogram of Questionnaire consent
AGE_1 Age F/U
AGE_1
v01003
[ratio, integer]
STUDY 49.872 50 51 4.4291 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.8808 -0.0315 (0.0447) -0.1855 2940
98.000%
60
2.000%
Histogram of Age F/U
SEX_1 Sex F/U
SEX_1
v01002
[nominal, integer]
STUDY females No. unique values (incl. NA): 3
‘females’ ‘males’
1472 1468
2940
98.000%
60
2.000%
Histogram of Sex F/U
PART_STUDY Study consent
PART_STUDY
v10000
[nominal, integer]
STUDY yes No. unique values (incl. NA): 2
‘yes’
2940
2940
98.000%
60
2.000%
Histogram of Study consent
SBP_0 Systolic blood pressure
SBP_0
v00004
[ratio, float]
PHYS_EXAM 126.516 127 130 9.613 13 (Q1 = 120 | Q3 = 133) 8.896 63 (97 - 160) 7.5982 0.0636 (0.0447) -0.5639 2561
85.367%
439
14.633%
Histogram of Systolic blood pressure


Interpretation

Algorithm of the implementation

  1. From the metadata, determine the scale level and the data type of each variable.
  2. For variables of scale level nominal, calculate the mode, and the number of categories and their frequency.
  3. For variables of scale level ordinal, calculate the median, the mode, the range of values (min - max), the interquartile range (1st and 3rd quartiles), and the number of categories and their frequency
  4. For variables of scale level interval, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  5. For variables of scale level ratio, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  6. Generate a data frame containing the summary for each variable.

Concept relations