Description

The des_summary function provides descriptive statistics for numerical and categorical variables in the study data.
Depending of the type of data, the function provides the appropriate measures of central tendency (i.e., mean, median, and mode); measures of dispersion (i.e., standard deviation, interquartile range, mean absolute deviation, range of values, and coefficient of variation); information on the number of categories and their frequency, on the shape of the distribution (skewness and kurtosis), and on missing data. It also provides plots to give an overview of the data distribution. The derived functions des_summary_categorical and des_summary_continuous only provide the appropriate descriptive statistics for variables of the matching type of data in the function name. The functions also work without metadata.

Usage and arguments

des_summary(
  resp_vars = NULL,
  study_data = sd1,
  label_col = LABEL,
  meta_data = md1
)

The function has the following arguments:

  • resp_vars: optional, a character specifying the measurement variables of interest. If missing, all variables from the study_data are assessed;
  • study_data: mandatory, the data frame containing the measurements;
  • meta_data: optional, the data frame containing the item-level metadata. If this refers to missing tables, then these have to be existing as files or URLs or be loaded before using prep_load_workbook_like_file(), prep_load_folder_with_metadata(), or prep_get_data_frame();
  • label_col: optional, the column in the metadata data frame containing the labels of all the variables in the study data.

Example output

To illustrate the output, we use the example synthetic data and metadata that are bundled with the dataquieR package. See the introductory tutorial for instructions on importing these files into R, as well as details on their structure and contents.

prep_load_workbook_like_file("meta_data_v2")
sd1 <- prep_get_data_frame("study_data")
des_sum <- des_summary(
  study_data = sd1,
  label_col = LABEL
)

The function generates 2 outputs SummaryData and SummaryTable, that are exactly the same in this case, but used differently in the creation of a report.

Output 1: Summary data frame

The summary data frame is called using des_sum$SummaryData:

Either as an interactive data.tables table:

des_sum

Or as a kable:

Variables Type STUDY_SEGMENT Mean Median Mode SD IQR (Quartiles) MAD Range (Min - Max) CV Skewness Kurtosis No. categories/Freq. table Valid Missing Graph
1 Examination center
CENTER_0
v00000
nominal, integer STUDY NA Berlin NA NA NA NA NA No. unique values (incl. NA): 5
‘Berlin’ ‘Leipzig’ ‘Munich’ ‘Hamburg’ ‘Cologne’
632 602 597 592 577
3000
100.00%
0
0.00%
Histogram of Examination center
2 Sex B/L
SEX_0
v00002
nominal, integer STUDY NA females NA NA NA NA NA No. unique values (incl. NA): 3
‘females’ ‘males’
1478 1462
2940
98.00%
60
2.00%
Histogram of Sex B/L
3 Age B/L
AGE_0
v00003
ratio, integer STUDY 49.914 50 51 4.4232 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.8616 -0.0367 (0.0447) -0.1761 NA 2940
98.000%
60
2.000%
Histogram of Age B/L
43 Age group B/L
AGE_GROUP_0
v00103
ordinal, string STUDY NA 50-59 50-59 NA 1 (Q1 = 40-49 | Q3 = 50-59) NA (30-39 - 60-69) NA NA NA No. unique values (incl. NA): 5
‘50-59’ ‘40-49’ ‘60-69’ ‘30-39’
1554 1322 39 25
0
0.00%
60
2.00%
Histogram of Age group B/L
46 Age F/U
AGE_1
v01003
ratio, integer STUDY 49.872 50 51 4.4291 6 (Q1 = 47 | Q3 = 53) 4.448 30 (33 - 63) 8.8808 -0.0315 (0.0447) -0.1855 NA 2940
98.000%
60
2.000%
Histogram of Age F/U
45 Sex F/U
SEX_1
v01002
nominal, integer STUDY NA females NA NA NA NA NA No. unique values (incl. NA): 3
‘females’ ‘males’
1472 1468
2940
98.00%
60
2.00%
Histogram of Sex F/U
48 Study consent
PART_STUDY
v10000
nominal, integer STUDY NA yes NA NA NA NA NA No. unique values (incl. NA): 2
‘yes’ ‘no’
2940 0
2940
98.00%
60
2.00%
Histogram of Study consent
4 Systolic blood pressure
SBP_0
v00004
ratio, float PHYS_EXAM 126.516 127 130 9.613 13 (Q1 = 120 | Q3 = 133) 8.896 63 (97 - 160) 7.5982 0.0636 (0.0447) -0.5639 NA 2561
85.367%
439
14.633%
Histogram of Systolic blood pressure
5 Diastolic blood pressure
DBP_0
v00005
ratio, float PHYS_EXAM 81.29 81 80 9.2142 12 (Q1 = 75 | Q3 = 87) 8.896 57 (54 - 111) 11.335 0.0805 (0.0447) -0.5562 NA 2544
84.800%
456
15.200%
Histogram of Diastolic blood pressure
6 Self-reported global health
GLOBAL_HEALTH_VAS_0
v00006
ratio, float PHYS_EXAM 5.027 5 3.2 2.9184 5.075 (Q1 = 2.5 | Q3 = 7.575) 3.706 10 (0 - 10) 58.0511 -0.0015 (0.0447) -1.4368 NA 2618
87.267%
382
12.733%
Histogram of Self-reported global health
7 Known asthma
ASTHMA_0
v00007
nominal, integer PHYS_EXAM NA no NA NA NA NA NA No. unique values (incl. NA): 3
‘no’ ‘yes’
2117 524
2641
88.03%
359
11.97%
Histogram of Known asthma
8 Aerobic capacity category
VO2_CAPCAT_0
v00008
ordinal, string PHYS_EXAM NA good excellent NA 3 (Q1 = excellent | Q3 = restricted) NA (excellent - pathological) NA NA NA No. unique values (incl. NA): 6
‘excellent’ ‘good’ ‘moderate’ ‘restricted’ ‘pathological’
784 647 500 380 284
0
0.00%
405
13.50%
Histogram of Aerobic capacity category
9 Upper arm circumference
ARM_CIRC_0
v00009
ratio, float PHYS_EXAM 25.033 25 24 3.9576 6 (Q1 = 22 | Q3 = 28) 4.448 27 (11 - 38) 15.8097 -0.0237 (0.0447) -0.3594 NA 2657
88.567%
343
11.433%
Histogram of Upper arm circumference
44 Upper arm circumference cat
ARM_CIRC_DISC_0
v00109
ordinal, integer PHYS_EXAM NA (20,30] (20,30] NA 0 (Q1 = (20,30] | Q3 = (20,30]) NA ((-Inf,20] - (30, Inf]) NA NA NA No. unique values (incl. NA): 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2071 344 218
2633
87.77%
367
12.23%
Histogram of Upper arm circumference cat
10 Upper arm circumference device
ARM_CUFF_0
v00010
ordinal, integer PHYS_EXAM NA (20,30] (20,30] NA 0 (Q1 = (20,30] | Q3 = (20,30]) NA ((-Inf,20] - (30, Inf]) NA NA NA No. unique values (incl. NA): 4
‘(20,30]’ ‘(-Inf,20]’ ‘(30, Inf]’
2015 351 257
2623
87.43%
377
12.57%
Histogram of Upper arm circumference device
11 Aerobic capacity examiner
USR_VO2_0
v00011
nominal, string PHYS_EXAM NA USR_321 NA NA NA NA NA No. unique values (incl. NA): 16
‘USR_321’ ‘USR_590’ ‘USR_213’ ‘USR_592’ ‘USR_211’ Others
449 301 223 223 216 1370
0
0.00%
218
7.27%
Histogram of Aerobic capacity examiner
12 Blood pressure examiner
USR_BP_0
v00012
nominal, string PHYS_EXAM NA USR_301 NA NA NA NA NA No. unique values (incl. NA): 16
‘USR_301’ ‘USR_243’ ‘USR_537’ ‘USR_542’ ‘USR_123’ Others
448 347 319 208 201 1252
0
0.00%
225
7.50%
Histogram of Blood pressure examiner
13 Examination date and time
EXAM_DT_0
v00013
interval, datetime PHYS_EXAM 2018-07-02 10:09:59 UTC 2018-07-05 19:45:30 UTC 2018-03-21 20:44:05 UTC 2018-04-12 05:25:04 UTC and other 3 dates 9064795 secs 15347279 secs (Q1 = 2018-04-03 04:24:05 UTC | Q3 = 2018-09-27 19:32:04 UTC) 11520753 secs 364 days (2018-01-01 UTC - 2018-12-31 UTC) NA 2940
98.000%
60
2.000%
Histogram of Examination date and time
49 Physical exam consent
PART_PHYS_EXAM
v20000
nominal, integer PHYS_EXAM NA yes NA NA NA NA NA No. unique values (incl. NA): 2
‘yes’ ‘no’
2940 0
2940
98.00%
60
2.00%
Histogram of Physical exam consent
14 C-reactive protein
CRP_0
v00014
ratio, float LAB 2.888 2.587 0.16 1.8053 2.27 (Q1 = 1.608 | Q3 = 3.878) 1.637 11.894 (0.118 - 12.012) 62.5065 0.8966 (0.0447) 0.9983 NA 2699
89.967%
301
10.033%
Histogram of C-reactive protein
15 Erythrocyte sedimentation rate
BSG_0
v00015
ratio, float LAB 14.857 11 10 12.1348 14 (Q1 = 6 | Q3 = 20) 10.378 96 (0 - 96) 81.6771 1.3774 (0.0447) 2.678 NA 2686
89.533%
314
10.467%
Histogram of Erythrocyte sedimentation rate
16 Device ID
DEV_NO_0
v00016
nominal, integer LAB NA 2 NA NA NA NA NA No. unique values (incl. NA): 6
‘2’ ‘3’ ‘1’ ‘4’ ‘5’
661 626 593 412 400
2692
89.73%
308
10.27%
Histogram of Device ID
17 Lab analysis date and time
LAB_DT_0
v00017
interval, datetime LAB 2018-07-02 12:00:36 UTC 2018-07-05 21:50:00 UTC 2018-01-01 02:00:00 UTC 2018-01-10 15:55:28 UTC and other 59 dates 9064818 secs 15342119 secs (Q1 = 2018-04-03 06:21:05 UTC | Q3 = 2018-09-27 20:03:04 UTC) 192027.4 mins 364.0007 days (2018-01-01 02:00:00 UTC - 2018-12-31 02:01:00 UTC) NA 2940
98.000%
60
2.000%
Histogram of Lab analysis date and time
50 Lab analysis consent
PART_LAB
v30000
nominal, integer LAB NA yes NA NA NA NA NA No. unique values (incl. NA): 2
‘yes’ ‘no’
2940 0
2940
98.00%
60
2.00%
Histogram of Lab analysis consent
18 Highest educational level B/L
EDUCATION_0
v00018
ordinal, integer INTERVIEW NA uppersecond uppersecond NA 2 (Q1 = secondary | Q3 = postsecond) NA (pre-primary - secondtertiary) NA NA NA No. unique values (incl. NA): 8
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
568 557 432 376 258 281
2472
82.40%
528
17.60%
Histogram of Highest educational level B/L
47 Highest educational level F/U
EDUCATION_1
v01018
ordinal, integer INTERVIEW NA uppersecond uppersecond NA 2 (Q1 = secondary | Q3 = postsecond) NA (pre-primary - 7) NA NA NA No. unique values (incl. NA): 9
‘uppersecond’ ‘secondary’ ‘postsecond’ ‘primary’ ‘tertiary’ Others
562 541 428 360 260 274
2425
80.83%
575
19.17%
Histogram of Highest educational level F/U
19 Marital status
FAM_STAT_0
v00019
nominal, integer INTERVIEW NA 1 NA NA NA NA NA No. unique values (incl. NA): 5
‘1’ ‘2’ ‘0’ ‘3’ ‘single’ Others
845 721 588 235 0 0
2389
79.63%
611
20.37%
Histogram of Marital status
20 Currently married
MARRIED_0
v00020
nominal, integer INTERVIEW NA no NA NA NA NA NA No. unique values (incl. NA): 3
‘no’ ‘yes’
2005 361
2366
78.87%
634
21.13%
Histogram of Currently married
21 Number of children
N_CHILD_0
v00021
ratio, integer INTERVIEW 2.499 2 2 1.5297 2 (Q1 = 1 | Q3 = 3) 1.483 9 (0 - 9) 61.2097 0.4907 (0.0447) -0.2911 NA 2336
77.867%
664
22.133%
Histogram of Number of children
22 Eating preferences
EATING_PREFS_0
v00022
nominal, integer INTERVIEW NA none NA NA NA NA NA No. unique values (incl. NA): 4
‘none’ ‘vegetarian’ ‘vegan’
1366 712 250
2328
77.60%
672
22.40%
Histogram of Eating preferences
23 Meat consumption
MEAT_CONS_0
v00023
ordinal, integer INTERVIEW NA 1-2d a week never NA 2 (Q1 = never | Q3 = 3-4d a week) NA (never - daily) NA NA NA No. unique values (incl. NA): 6
‘never’ ‘3-4d a week’ ‘1-2d a week’ ‘5-6d a week’ ‘daily’
946 475 382 326 173
2302
76.73%
698
23.27%
Histogram of Meat consumption
24 Current smoker
SMOKING_0
v00024
nominal, integer INTERVIEW NA no NA NA NA NA NA No. unique values (incl. NA): 3
‘no’ ‘yes’
1585 707
2292
76.40%
708
23.60%
Histogram of Current smoker
25 Purchasing tobacco products
SMOKE_SHOP_0
v00025
ordinal, integer INTERVIEW NA 3-4d a week 3-4d a week NA 2 (Q1 = 1-2d a week | Q3 = 5-6d a week) NA (never - 5) NA NA NA No. unique values (incl. NA): 7
‘3-4d a week’ ‘5-6d a week’ ‘daily’ ‘1-2d a week’ ‘never’ Others
176 169 154 150 133 24
806
26.87%
2194
73.13%
Histogram of Purchasing tobacco products
26 Number of injuries
N_INJURIES_0
v00026
ratio, integer INTERVIEW 4.588 4 4 2.4221 3 (Q1 = 3 | Q3 = 6) 2.965 14 (0 - 14) 52.7921 0.4743 (0.0447) -0.5724 NA 2199
73.300%
801
26.700%
Histogram of Number of injuries
27 Number of births
N_BIRTH_0
v00027
ratio, integer INTERVIEW 3.458 3 3 1.7707 2 (Q1 = 2 | Q3 = 4) 1.483 12 (0 - 12) 51.2115 0.2105 (0.0447) -1.7027 NA 1099
36.633%
1901
63.367%
Histogram of Number of births
28 Income group
INCOME_GROUP_0
v00028
ordinal, integer INTERVIEW NA [30-50k) [30-50k) NA 2 (Q1 = [10-30k) | Q3 = [50-70k)) NA (below 10k - above 90k) NA NA NA No. unique values (incl. NA): 7
‘[30-50k)’ ‘[10-30k)’ ‘[50-70k)’ ‘below 10k’ ‘[70-90k)’ Others
622 584 395 294 205 74
2174
72.47%
826
27.53%
Histogram of Income group
29 Currently pregnant
PREGNANT_0
v00029
nominal, integer INTERVIEW NA no NA NA NA NA NA No. unique values (incl. NA): 3
‘no’ ‘yes’
996 69
1065
35.50%
1935
64.50%
Histogram of Currently pregnant
30 Medication use
MEDICATION_0
v00030
nominal, integer INTERVIEW NA yes NA NA NA NA NA No. unique values (incl. NA): 4
‘yes’ ‘3’ ‘2’ ‘no’
292 205 144 0
641
21.37%
2359
78.63%
Histogram of Medication use
31 Number of ATC codes
N_ATC_CODES_0
v00031
ratio, integer INTERVIEW 2.262 1 0 2.7257 3 (Q1 = 0 | Q3 = 3) 1.483 22 (0 - 22) 120.4786 1.4057 (0.0447) 3.0554 NA 2058
68.600%
942
31.400%
Histogram of Number of ATC codes
32 Sociodemographics examiner
USR_SOCDEM_0
v00032
nominal, string INTERVIEW NA USR_321 NA NA NA NA NA No. unique values (incl. NA): 16
‘USR_321’ ‘USR_247’ ‘USR_520’ ‘USR_120’ ‘USR_125’ Others
380 297 290 172 159 988
0
0.00%
714
23.80%
Histogram of Sociodemographics examiner
33 Interview date and time
INT_DT_0
v00033
interval, datetime INTERVIEW 2018-07-02 12:40:34 UTC 2018-07-05 22:27:30 UTC 2018-08-15 22:55:34 UTC 2018-08-24 01:57:42 UTC 9064782 secs 15348014 secs (Q1 = 2018-04-03 06:44:50 UTC | Q3 = 2018-09-27 22:05:04 UTC) 11522577 secs 363.9972 days (2018-01-01 02:24:00 UTC - 2018-12-31 02:20:00 UTC) NA 2940
98.000%
60
2.000%
Histogram of Interview date and time
51 Interview consent
PART_INTERVIEW
v40000
nominal, integer INTERVIEW NA yes NA NA NA NA NA No. unique values (incl. NA): 3
‘yes’ ‘no’
2924 16
2940
98.00%
60
2.00%
Histogram of Interview consent
34 Item 1
ITEM_1_0
v00034
ratio, integer QUESTIONNAIRE 3.037 3 3 1.764 2 (Q1 = 2 | Q3 = 4) 1.483 9 (0 - 9) 58.0849 0.4138 (0.0447) -0.5894 NA 2248
74.933%
752
25.067%
Histogram of Item 1
35 Item 2
ITEM_2_0
v00035
ratio, integer QUESTIONNAIRE 2.988 3 2 1.7008 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.9277 0.4079 (0.0447) -0.648 NA 2197
73.233%
803
26.767%
Histogram of Item 2
36 Item 3
ITEM_3_0
v00036
ratio, integer QUESTIONNAIRE 3.014 3 3 1.7175 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 56.9898 0.4105 (0.0447) -0.6058 NA 2184
72.800%
816
27.200%
Histogram of Item 3
37 Item 4
ITEM_4_0
v00037
ratio, integer QUESTIONNAIRE 3 3 3 1.7214 2 (Q1 = 2 | Q3 = 4) 1.483 10 (0 - 10) 57.3701 0.4347 (0.0447) -0.5563 NA 2143
71.433%
857
28.567%
Histogram of Item 4
38 Item 5
ITEM_5_0
v00038
ratio, integer QUESTIONNAIRE 6.021 6 6 2.3738 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.4237 -0.1831 (0.0447) -1.3516 NA 2074
69.133%
926
30.867%
Histogram of Item 5
39 Item 6
ITEM_6_0
v00039
ratio, integer QUESTIONNAIRE 5.948 6 6 2.3706 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.8567 -0.1194 (0.0447) -1.4139 NA 2048
68.267%
952
31.733%
Histogram of Item 6
40 Item 7
ITEM_7_0
v00040
ratio, integer QUESTIONNAIRE 6.037 6 6 2.4007 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 39.7687 -0.1739 (0.0447) -1.3982 NA 2068
68.933%
932
31.067%
Histogram of Item 7
41 Item 8
ITEM_8_0
v00041
ratio, integer QUESTIONNAIRE 5.895 6 6 2.3974 4 (Q1 = 4 | Q3 = 8) 2.965 10 (0 - 10) 40.6699 -0.0971 (0.0447) -1.4874 NA 2013
67.100%
987
32.900%
Histogram of Item 8
42 Questionnaire date and time
QUEST_DT_0
v00042
interval, datetime QUESTIONNAIRE 2018-08-03 23:09:06 UTC 2018-07-29 21:52:59 UTC 2017-12-31 22:59:59 UTC 12076278 secs 16596520 secs (Q1 = 2018-04-20 11:41:02 UTC | Q3 = 2018-10-29 13:49:41 UTC) 3404.74 hours 1042.614 days (2017-12-31 22:59:59 UTC - 2020-11-08 13:43:28 UTC) NA 2940
98.000%
60
2.000%
Histogram of Questionnaire date and time
52 Questionnaire consent
PART_QUESTIONNAIRE
v50000
nominal, integer QUESTIONNAIRE NA yes NA NA NA NA NA No. unique values (incl. NA): 3
‘yes’ ‘no’
2864 76
2940
98.00%
60
2.00%
Histogram of Questionnaire consent


Interpretation

Algorithm of the implementation

  1. From the metadata, determine the scale level and the data type of each variable.
  2. For variables of scale level nominal, calculate the mode, and the number of categories and their frequency.
  3. For variables of scale level ordinal, calculate the median, the mode, the range of values (min - max), the interquartile range (1st and 3rd quartiles), and the number of categories and their frequency
  4. For variables of scale level interval, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  5. For variables of scale level ratio, calculate the mean, the median, the mode, the standard deviation (SD), the MAD, the range of values (min - max), the interquartile range (1st and 3rd quartiles), the coefficient of variation. In addition, only for data types integer or float, calculate the skewness and its standard error, and the kurtosis.
  6. Generate a data frame containing the summary for each variable.

Concept relations