Introduction

dataquieR provides many outputs ready to be integrated with a quality report. However, users’ requirements are usually more specific. This tutorial explains how to adjust dataquieR’s main result types (data frames and ggplot2-graphics) to meet particular needs.

Example output of dataquieR

The basic example used in this documentation requires two objects, which are mandatory for all dataquieR functions:

  • study data, and
  • metadata.

These are loaded from the dataquieR package:

load(system.file("extdata", "study_data.RData", package = "dataquieR"))
sd1 <- study_data

load(system.file("extdata", "meta_data.RData", package = "dataquieR"))
md1 <- meta_data

The example output is generated using the dataquieR function: com_item_missingness().

tab_ex1 <- com_item_missingness(study_data = sd1,
                                meta_data = md1,
                                threshold_value = 90,
                                include_sysmiss = TRUE)

This function generates four objects: SummaryTable, SummaryData, SummaryPlot, ReportSummaryTable. The first two are data frames, and the last two are ggplots. The following steps show how to edit these objects.

Data frames

For the use of data frames in data quality reporting, there are two important aspects.

  1. they should be displayed in a neat and comprehensible way. For this aspect, many packages exist, e.g. xtable, kableExtra, pixiedust, huxtable and DT, each of which integrates with some of the most output formats supported by rmarkdown/pandoc, namely html, docx, pdf, and flexdashbaord. For using these package, we ask the reader to refer to these packages’ documentation, please.

  2. Given the size of data frames there must be ways to filter and / or sort them, to add or remove columns, and to rename columns. For these issues a good choice is the tidyverse with the dplyr package.

Related with the next point (ggplot2 graphics generated by dataquieR), wide- and long-format is another point with tables. tidyr is one possible choice for transforming tables from long- to wide-format.

The most simple output of the data frame appears like this (first 10 shown only to reduce file size):

knitr::kable(head(tab_ex1$SummaryTable, 10))
Variables Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
v00000 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00001 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00002 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00003 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00103 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v01003 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v01002 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v10000 3000 60 (2) 2940 (98) 0 (0) 0 (0) 2940 (98) 0
v00004 2940 239 (8.13) 2701 (91.87) 140 (4.76) 0 (0) 2561 (87.11) 1
v00005 2940 233 (7.93) 2707 (92.07) 163 (5.54) 0 (0) 2544 (86.53) 1


Styling

The table above comprises information regarding missing values of all variables in the study data. Nevertheless, it represents not the most beautiful output. We may use some functionality of the kableExtra package and attach this formats to the present table using dplyr.

suppressPackageStartupMessages(library(dplyr))
library(kableExtra)
kable(tab_ex1$SummaryTable, "html") %>%
  kable_styling(bootstrap_options = c("hover"))
Variables Observations N Sysmiss N (%) Datavalues N (%) Missing codes N (%) Jumps N (%) Measurements N (%) GRADING
v00000 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00001 2940 0 (0) 2940 (100) 0 (0) 0 (0) 2940 (100) 0
v00002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00103 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01003 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v01002 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v10000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00004 2940 299 (10.17) 2641 (89.83) 140 (4.76) 0 (0) 2501 (85.07) 1
v00005 2940 293 (9.97) 2647 (90.03) 163 (5.54) 0 (0) 2484 (84.49) 1
v00006 2940 306 (10.41) 2634 (89.59) 76 (2.59) 0 (0) 2558 (87.01) 1
v00007 2940 287 (9.76) 2653 (90.24) 72 (2.45) 0 (0) 2581 (87.79) 1
v00008 2940 285 (9.69) 2655 (90.31) 120 (4.08) 0 (0) 2535 (86.22) 1
v00009 2940 280 (9.52) 2660 (90.48) 63 (2.14) 0 (0) 2597 (88.33) 1
v00109 2940 298 (10.14) 2642 (89.86) 69 (2.35) 0 (0) 2573 (87.52) 1
v00010 2940 296 (10.07) 2644 (89.93) 81 (2.76) 0 (0) 2563 (87.18) 1
v00011 2940 149 (5.07) 2791 (94.93) 69 (2.35) 0 (0) 2722 (92.59) 0
v00012 2940 140 (4.76) 2800 (95.24) 85 (2.89) 0 (0) 2715 (92.35) 0
v00013 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v20000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00014 2940 232 (7.89) 2708 (92.11) 69 (2.35) 0 (0) 2639 (89.76) 1
v00015 2940 242 (8.23) 2698 (91.77) 72 (2.45) 0 (0) 2626 (89.32) 1
v00016 2940 308 (10.48) 2632 (89.52) 0 (0) 0 (0) 2632 (89.52) 1
v00017 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v30000 2940 60 (2.04) 2880 (97.96) 0 (0) 0 (0) 2880 (97.96) 0
v00018 2940 148 (5.03) 2792 (94.97) 380 (12.93) 0 (0) 2412 (82.04) 1
v01018 2924 159 (5.44) 2765 (94.56) 416 (14.23) 0 (0) 2349 (80.34) 1
v00019 2924 198 (6.77) 2726 (93.23) 413 (14.12) 0 (0) 2313 (79.1) 1
v00020 2924 202 (6.91) 2722 (93.09) 432 (14.77) 0 (0) 2290 (78.32) 1
v00021 2924 236 (8.07) 2688 (91.93) 428 (14.64) 0 (0) 2260 (77.29) 1
v00022 2924 224 (7.66) 2700 (92.34) 448 (15.32) 0 (0) 2252 (77.02) 1
v00023 2924 247 (8.45) 2677 (91.55) 451 (15.42) 0 (0) 2226 (76.13) 1
v00024 2924 259 (8.86) 2665 (91.14) 449 (15.36) 0 (0) 2216 (75.79) 1
v00025 2924 1681 (57.49) 1243 (42.51) 513 (17.54) 0 (0) 730 (24.97) 1
v00026 2924 320 (10.94) 2604 (89.06) 481 (16.45) 0 (0) 2123 (72.61) 1
v00027 2924 289 (9.88) 2635 (90.12) 499 (17.07) 1113 (38.06) 1023 (56.49) 1
v00028 2924 311 (10.64) 2613 (89.36) 515 (17.61) 0 (0) 2098 (71.75) 1
v00029 2924 350 (11.97) 2574 (88.03) 519 (17.75) 1066 (36.46) 989 (53.23) 1
v00030 2924 1809 (61.87) 1115 (38.13) 550 (18.81) 0 (0) 565 (19.32) 1
v00031 2924 386 (13.2) 2538 (86.8) 556 (19.02) 0 (0) 1982 (67.78) 1
v00032 2924 382 (13.06) 2542 (86.94) 332 (11.35) 0 (0) 2210 (75.58) 1
v00033 2924 60 (2.05) 2864 (97.95) 0 (0) 0 (0) 2864 (97.95) 0
v40000 2924 60 (2.05) 2864 (97.95) 0 (0) 0 (0) 2864 (97.95) 0
v00034 2924 453 (15.49) 2471 (84.51) 299 (10.23) 0 (0) 2172 (74.28) 1
v00035 2864 479 (16.72) 2385 (83.28) 324 (11.31) 0 (0) 2061 (71.96) 1
v00036 2864 491 (17.14) 2373 (82.86) 325 (11.35) 0 (0) 2048 (71.51) 1
v00037 2864 483 (16.86) 2381 (83.14) 374 (13.06) 0 (0) 2007 (70.08) 1
v00038 2864 552 (19.27) 2312 (80.73) 374 (13.06) 0 (0) 1938 (67.67) 1
v00039 2864 563 (19.66) 2301 (80.34) 389 (13.58) 0 (0) 1912 (66.76) 1
v00040 2864 531 (18.54) 2333 (81.46) 401 (14) 0 (0) 1932 (67.46) 1
v00041 2864 560 (19.55) 2304 (80.45) 427 (14.91) 0 (0) 1877 (65.54) 1
v00042 2864 60 (2.09) 2804 (97.91) 0 (0) 0 (0) 2804 (97.91) 0
v50000 2864 60 (2.09) 2804 (97.91) 0 (0) 0 (0) 2804 (97.91) 0

Paging

The table above is getting very long. Another possibility is to use paged output of data frames. Therefore a simple line in the YAML-header must be added (df_print: paged) under output. A simple call of the data frame allows then the browsing of rows and columns. Alternatively, you may use the DT package, even as default printer for data.frames.

tab_ex1$SummaryTable

To use DT, you would have to add a chunk like the following to your R-Markdown file:

```{r include=FALSE}
library(knitr)
library(DT)
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
knit_print.data.frame = function(x, ...) { knit_print(DT::datatable(x), ...) }
registerS3method("knit_print", "data.frame", knit_print.data.frame)
```

Remove columns

In some instances, removing a column could be needed. For example, we could remove the Observations N column via the \(-\) operator:

tab_ex1$SummaryTable %>%
  select(-'Observations N') 

The column Variables contains rather technical names of variables not enabling for interpretation of the content. For this reason, all dataquieR functions have an option called label_col. The selected label can be any column in the meta data, our model suggests to name that column LABEL. For time being, the labels must be valid in R formulas, which means, they should basically not contain characters other than letters or numbers. We plan to relax this condition.

tab_ex2 <- com_item_missingness(study_data = sd1,
                                meta_data = md1,
                                threshold_value = 90,
                                label_col = "LABEL",
                                include_sysmiss = TRUE,
                                show_causes = FALSE)

tab_ex2$SummaryTable %>%
  select(-'Observations N')

Order rows

Maybe, we want to sort columns or rows. This can also be achieved by dplyr functions:

tab_ex2$SummaryTable %>%
  select(-'Observations N') %>%
  arrange(desc(`Measurements N (%)`)) 

Sorting by the number of observations is a bit complicated up to now, because currently dataquieR returns text in the columns. The text can be extracted using the following code:

splitted_measurements_col <- # this will be a list of character vectors of length 2 (part before and part after the '(' character for each row)
  strsplit(tab_ex2$SummaryTable$`Measurements N (%)`, # the measurement count column
           '(', # splited at the opening bracket
           fixed = TRUE # fixed string match, no pattern match
           )
percent_part_in_col <- # this will be a character vector of the percentages
  unlist( # we don't want to have a list but a vector of percentages as usually for data frame columns
    lapply(splitted_measurements_col, `[[`, 2) # select the second entry of each entry in the list
  )
sort_order <- as.numeric(sub(')', '', percent_part_in_col, fixed = TRUE)) # remove the closing bracket and convert the characters to numbers
tab_ex2$SummaryTable %>%
  select(-'Observations N') %>%
  arrange(desc(sort_order)) 

Reorder columns

Maybe the columns should be in some other order too:

tab_ex2$SummaryTable %>%
  select(-'Observations N') %>% # the GRADING column must be removed without using the everyting() in the next row, so we keep to lines.
  select(`Variables`, `Measurements N (%)`, everything()) # everything adds all columns not yet available.

Plots from ggplot2

The versatile ggplot2 package provides possibilities to modify graphics after they have been created, to render them in vector formats and even to extract the underlying data. It is handy for interfacing with user code. Also, ggplot2 has a comprehensive concept behind, a graphics grammar, which makes it highly structured and using its code easy to understand. For more advice about the ggplot2 package, we refer kindly to the vignettes of that package:

browseVignettes(package = "ggplot2")

The package dataquieR generates two types of ggplot-objects.

  1. Either a single summary plot called SummaryPlot or
  2. a list of plots called SummaryPlotList.

The latter is used if several plots are generated, typically for each variable of the study data. As the handling and manipulation of a single SummaryPlot is more straightforward we exemplify a plot list using the dataquieR function acc_distributions:

ex1 <- acc_distributions(resp_vars      = NULL, 
                         group_vars     = NULL, 
                         label_col      = "LABEL",
                         study_data     = sd1, 
                         meta_data      = md1)
#> All variables defined to be integer or float in the metadata are used by acc_distributions.
#> Variable 'PART_STUDY' (resp_vars) has fewer distinct values than required for the argument 'resp_vars' of 'acc_distributions'
#> Variable 'PART_PHYS_EXAM' (resp_vars) has fewer distinct values than required for the argument 'resp_vars' of 'acc_distributions'
#> Variable 'PART_LAB' (resp_vars) has fewer distinct values than required for the argument 'resp_vars' of 'acc_distributions'
#> Variable 'MEDICATION_0' (resp_vars) has fewer distinct values than required for the argument 'resp_vars' of 'acc_distributions'

This yields a set of 39 figures! All of which are ggplot2 objects:

unique(unlist(lapply(ex1$SummaryPlotList, class)))
#> [1] "gg"     "ggplot"

There is a package named ggedit for editing ggplot2-objects easily. Nevertheless, in the following the basics to do so are discussed. For more complex adjustments, we recommend now ggedit.

Lists of plots

To list them all, a simple print of the ex1$SummaryPlotList can be used, but this will also print the “normal” output of printing a list, i.e. the names or numbers of all its elements. To avoid this, you can simply print each element of the list separately:

# for (i in 1:length(ex1$SummaryPlotList)) # substituted by the next row to shorten the output of this vignette:
for (i in head(seq_along(ex1$SummaryPlotList), 4)) {
  print(ex1$SummaryPlotList[[i]])
}

Of course, an apply-iteration would be possible too, but for the means of plotting figures, the for loop perfectly fits.

Using this code, all figures are printed one below the other. To have them in columns, the chunk-option out.width can be handy. rmarkdown plots figures aside, if the current row is not yet filled, so something like out.width=c('50%', '50%') can be used to achieve a two-column image list.

Arrange plots

Another possibility to arrange list of plots is the ggpubr package which handles a specific formal for lists of ggplot2 objects.

ggpubr::ggarrange(plotlist = ex1$SummaryPlotList[1:4])

An alternative to ggpubris the patchwork-package, which provides a very intuitive way of aligning ggplot2 graphics:

library(patchwork)
p1 <- ex1$SummaryPlotList[[1]]
p2 <- ex1$SummaryPlotList[[2]]
p3 <- ex1$SummaryPlotList[[3]]

p1 | (p2 / p3)

See the patchwork vignette for more details.

Plot rotation

The following example rotates the plot so that the counts appear in the x-axis. For this, we can use the +-operator in combination with the function coord_flip:

library(ggplot2)
print(
  ex1$SummaryPlotList[[3]] +
    coord_flip()
)
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.

To add a red line, we use the annotate function to draw objects not directly mapped (by aes) to specific data points/samples (which avoids redundant plotting):

library(ggplot2)
print(
  ex1$SummaryPlotList[[3]] +
    coord_flip() +
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") 
)
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.

Highlighting

Then, we may like to highlight the largest bin in red. For this, we need to access the bins calculated by geom_histogram which the ggplot_build function makes accessible for ggplot2-objects:

p <- ex1$SummaryPlotList[[3]] # choose the third figure generated by dataquieR.
x <- ggplot_build(p) # make its graphical properties accessible.
largest_bin <- which.max(x[["data"]][[1]][["count"]]) # find the largest bin.
print(x[["data"]][[1]][largest_bin, c("xmin", "xmax", "ymin", "ymax")]) # this would print out the cartesian coordinates of the largest bin.
#>    xmin xmax ymin ymax
#> 18   50   51    0  264
# see also the helpful contribution there: https://community.rstudio.com/t/geom-histogram-max-bin-height/10026
print( # print
  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red")
)

Annotation

Unfortunately, the annotate function’s documentation is maybe a bit sparse. The geom-parameter refers to existing implementations of graphics in ggplot2 all of which are prefixed with geom_. Usually they extract their coordinates from the data using the mapping given in the aes-parameter of the whole ggplot2 object or for the specific geom. A useful geom_s besides segment and rect is text for really annotating the plot:

print( # print
  p +  # the plot
    scale_y_continuous(limits = c(0, 300)) +  # expand y-axis to make annotations visible
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 270, vjust = -.5)
)

You may see the documentation of ggplot2::annotate for some examples.

Coordinates are given in the same coordinate system that is shown in the plot, so drawing a line at 100 observations is as easy as directly choosing 100 as y coordinate.

print( # print
  p + # the plot
    scale_y_continuous(limits = c(0, 300)) +  # expand y-axis to make annotations visible
    annotate("segment", x = -Inf, xend = Inf, y = 100, yend = 100, colour = "red") + # annotate it with the red line again
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 270, vjust = -.5) 
)

We will now rotate the annotation:

p2 <-  p +  # the plot
    annotate("segment", x = -Inf, xend = Inf, y = 100, yend = 100, colour = "red") + # annotate it with the red line again
    annotate("segment", x = -Inf, xend = Inf, y = 0, yend = 0, colour = "red") + # annotate it with the red line again
    annotate("rect", # and highlight the largest bin by overplotting it with red framed black rectangle.
             xmin = x[["data"]][[1]]$xmin[[largest_bin]], 
             xmax = x[["data"]][[1]]$xmax[[largest_bin]], 
             ymin = x[["data"]][[1]]$ymin[[largest_bin]], 
             ymax = x[["data"]][[1]]$ymax[[largest_bin]], color = "red") +
    annotate("text", label = "Largest bin", x = x[["data"]][[1]]$xmax[[largest_bin]], y = x[["data"]][[1]]$ymax[[largest_bin]], angle = 0, vjust = -.5)
suppressMessages(p2 + coord_cartesian()) # this restores the original cartesian coordinate system replacing the flipped one introduced by acc_distributions However, it emits a message about replacing the coordinate system, which we can suppress here with suppressMessages.

# Note, that neither `ggplot2::coord_flip` nor `ggpubr::rotate` can solve this 
# issue. These functions are not aware of already-rotated plots, so the following 
# will *not* rotate the plot back:
# 
# ```{r}
# p2 + coord_flip()     # does not rotate the plot but prints
#                       # Coordinate system already present. Adding new coordinate
#                       # system, which will replace the existing one.
# 
# p2 + ggpubr::rotate() # does not rotate the plot but prints
#                       # Coordinate system already present. Adding new coordinate
#                       # system, which will replace the existing one.
# ```

Add new data

All functions of the dataquieR use the data as they are imported, i.e. variables of the study data can be examined and used for grouping/stratification of results. All information for these variables must be attached to the metadata. In some situations, particularly during exploitative data quality reporting, it is necessary to use a new calculated/transformed variable. Naturally, respective information is not defined in the metadata. This peculiarity would preclude the use of such calculated or transformed variables in data quality reporting.

To illustrate the need of a helper function is shown with the following example from com_segment_missingness():

MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md1, 
                                    threshold_value = 1, 
                                    color_gradient_direction = "above",
                                    exclude_roles = c("secondary", "process"))

The SummaryPlot shows the frequency of observations in which all measurements of respective study segments are missing.

MissSegs$SummaryPlot

Exploring the segment missingness over time would require another variable in the study data. We will generate such a variable using the lubridate package.

sd1$exq <- as.integer(lubridate::quarter(sd1$v00013))
table(sd1$exq)
#> 
#>   1   2   3   4 
#> 724 713 776 727

Information regarding this variable is then added to a copy of the metadata (md2) using the dataquieR function prep_add_to_meta():

md2 <- dataquieR::prep_add_to_meta(VAR_NAMES = "exq", 
                                   DATA_TYPE = "integer",
                                   LABEL = "EX_QUARTER_0",
                                   VALUE_LABELS = "1 = 1st | 2 = 2nd | 3 = 3rd | 4 = 4th",
                                   VARIABLE_ROLE = "process",
                                   MISSING_LIST = "",
                                   meta_data = md1)
MissSegs <- com_segment_missingness(study_data = sd1, 
                                    meta_data = md2, 
                                    threshold_value = 1, 
                                    label_col = LABEL,
                                    group_vars = "EX_QUARTER_0",
                                    color_gradient_direction = "above",
                                    exclude_roles = "process")

MissSegs$SummaryPlot

Back to Overview