Conventions for Contributing Code

Version 0.5.1

Abstract

This document is currently under heavy revision, so please contact us, if you want to contribute. Nevertheless, it provides guidance on the writing of R functions for data quality assessments to ensure exchangeable R-Code by using a homogeneous structure of functions and data. Therefore, conventions regarding the input/output of data are defined, naming conventions are introduced, and documentation requirements are denoted. In addition we provide recommendations regarding (machine-readable) output and visualization.

Introduction

Code developed in teams, especially if intended to be used by a larger community, needs to be comprehensible for all team members and ideally also for anyone trying to use that code. Here we provide some conventions that resulted from the 1st project phase.

Please note that this document is work in progress, and some sections are not complete yet. Stay tuned for the updates.

Prerequisites

Data sources

The concept distinguishes two types of data sources:

Study data:
1. Clinical data: Measurements (organized within variables) intended to be subject to data quality assessments
2. Process information: all data providing information on the measurement process, such as time, ambient variables, the respective device or examiner.
Meta data:
1. Contain the expected characteristics of study data on different levels, e.g., on the level of each variable, labels, limits, or missing codes. Also the allocation of respective process information organized in variables.
2. Further tables referencing descriptions such as labels of missing codes.

For further information see Richter et al.

Programming Concept

Usability of functions

All implementations of the project were developed to be applied alone or in a vectorized reporting pipeline.

Using metdata and process information

So-called process variables store meta data about the measurement process. Content of these variables represent measurements and are therefore stored with the study data. The names of the variables to use are passed in a function argument. Process variable names can also be stored in the attributes of a study variable. Such variable attributes referring other variables are usually prefixed by KEY_. Some such key attributes are listed in the table below. There is a wrapper function named pipeline_vectorized to automatically extract this information from the meta data and to provision parallel function calls with the respective function arguments. This is primarily used for calling functions of the Dimension Accuracy for a set of variables at once, because this dimension’s functions frequently

need process variables
are univariate implementations
are computational demanding

For brevity, we here present pseudo-code:

my_function_4 <- function(  resp_var,
                            group_vars,
                            study_data,
                            meta_data
                          ) {
  s_data     <- study_data[ , resp_var ]
  group_data <- study_data[ , group_vars ]
# ...

Calling this function using pipeline_vectorized would work as follows:

named_list_of_results <-
  pipeline_vectorized(fct = my_function_4, resp_vars = c("SBP_2", "DBP_2", "HF_2"), study_data = study_data,
                      meta_data = meta_data, label_col = LABEL,
                      args_from_meta = c(group_vars = KEY_OBSERVER),
                      mc.cores = 4)
                      
# results are a named list of the univariate results:
named_list_of_results$SBP_2
named_list_of_results$HF_2
named_list_of_results$DBP_2

Later, this may also be extended by using classes for the variable based function arguments:

my_function_5 <- function(  resp_var,
                            group_vars,
                            study_data,
                            meta_data
                          ) {
  s_data     <- study_data[ , resp_var ]
  if (inherits(group_vars, 'process_var_att')) {
    group_data <- study_data[ , subset(meta_data, VAR_NAMES == resp_var, group_vars) ]
  } else {
    group_data <- study_data[ , group_vars ]
  }
# ...
}
proc_var <- function(x) {
  class(x) <- 'process_var_att'
  x
}

my_function_5( 
  'SBP_0',
  proc_var('KEY_OBSERVER'),
  study_data,
  meta_data
)

R-functions

Data Quality Indicator (DQI) implementations

Functions addressing items/variables can address one variable only (univariate). Additionally, they can address a set of variables (multivariate).

prefixes: acc, con, com, int

Reporting Functions

dq_report2, prep_, summary, print

Utility functions

util_

Functions that not directly address QA issues but perform consistency checks, data preparation, pipelining and other auxiliary tasks are called Utility functions and described in the section Use of Utility Functions

Example Study Data

Example Meta Data

# function calls:
my_function_1(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA, 
              label_col = 'LABEL', study_data = study_data, meta_data = meta)

## Error in my_function_1(resp_vars = colnames(study_data), co_vars = character(0), : could not find function "my_function_1"

try(
  my_function_2(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA, 
              label_col = 'LABEL', study_data = study_data, meta_data = meta) # expect to stop
)

## Error in my_function_2(resp_vars = colnames(study_data), co_vars = character(0),  : 
##   could not find function "my_function_2"

Conventions

R code

Style

R code should be structured as follows (derived from http://style.tidyverse.org/ (Hadley Wickham), and inspired by https://google.github.io/styleguide/Rguide.xml):

# required packages/code should be specified prior to user-defined functions -----------------------
library(ggplot2)

# source required functions prior to the function --------------------------------------------------
# such code will be later embedded in the R-package
source("some_other_function.R")

my_function <- function(x, formal_1, formal_n) {

   # start with all checks that safeguard applicability of the function ----------------
   if (missing(x) || length(x) == 0L || mode(x) != "numeric")
      stop("'x' must be a non-empty numeric vector")
   if (missing(formal_1) || missing(formal_n))
      stop("'attributes' must be specified")
  
   # main body of the function ---------------------------------------------------------
   x_mod <- ... # …   
   
   
   
   # call of nested function -----------------------------------------------------------
   result <- some_other_function(x_mod)

   # the output ------------------------------------------------------------------------
   return(result)
   
}

Since the targeted output is an R library (namely dataquieR), library and source should only being used during the internal drafting of code. In the R package, external libraries must be listed in the DESCRIPTION file (generally in its Imports-section) of the package and can be imported to the package namespace using roxygen2 comments.

Please also refer to https://cran.r-project.org/web/packages/policies.html

Function definitions

To ensure a generic usability of R-scripts, they should be organised in functions whose input arguments must not be handled in a static fashion. This is necessary, because the names and the number of variables, meta data attributes, process variables as well as the names of the data frames are not known a priori. All functions must be able to handle whatever variables and data sets are used, as long as these meet some structural preconditions as outlined above.

This comprises:

No hard coded variable names
No hard coded expected lengths of variable lists
No hard coded data frame names
No function embedded meta data attributes
No hard coded thresholds for decision making of quality assessments

To avoid misunderstandings: Hard coded names of meta data attribute fields must be used to properly retrieve related information. All necessary information to run the scripts is transferred via an appropriate function call. There are some defined names for formals that would also provide metadata. These are described in the section Formals and arguments. These formals are filled by the Reporting Functions, if called from a pipeline.

Formals and Arguments

We intentionally do not use the synonymous term “function parameters” to avoid ambiguities regarding the statistical term parameter related to probability distributions.

In the following, we give a table listing standardized function argument names. Functions can have additional arguments, but for the ones listed below, conventions exist. Also, these can be populated by Reporting Functions from the metadata when a report is being created.

Two formals are mandatory.

In the table above, there are two arguments (study_data and meta_data) mandatory for all indicator functions. These are declared to be data frames, which are explained in the following. The table with function arguments lists arguments mostly referring to study or process variables.

There may be additional parameters such as certain threshold values or arguments affecting the format of the generated output like specific colors or fonts. The latter should not be part of the functions in future, because the output including ggplot2 plots can be formatted later. The types of additional outputs depend on the specific use-cases. Also thresholds may be generalized later so that using threshold arguments is not recommended in favor of returning filterable results.

All function arguments are user input. So these have to be verified carefully.

There are a number of utility functions for argument checking and preparation:

prep_prepare_dataframes
util_correct_variable_use
util_expect_data_frame
ls(asNamespace("dataquieR"), pattern = "^util.*")

Checks of formals and arguments

For arguments referring to study variables, there is a family of utility functions for this: util_correct_variable_use and util_correct_variable_use2. These can check input arguments referring to variable names. Some examples:

util_correct_variable_use("resp_vars",                              # check function argument resp_vars
                           allow_null          = TRUE,              # allow resp_vars being NULL
                           allow_more_than_one = TRUE,              # allow more than one entry in resp_vars
                           allow_any_obs_na    = TRUE,              # allow resp_vars in study_data contain NAs (see stats::na.fail)
                           need_type           = "integer | float") # allow variabes of metadata-declared types integer or float

util_correct_variable_use("group_vars",                   # check function argument group_vars
                          allow_null          = TRUE,     # allow group_vars being NULL
                          allow_more_than_one = TRUE,     # allow more than one entry in group_vars
                          allow_any_obs_na    = TRUE,     # allow group_vars in study_data contain NAs (see stats::na.fail)
                          need_type           = "!float") # allow variabes of all possible  metadata-declared types except float

Please refer to the full documentation of util_correct_variable_use / util_correct_variable_use2 for an exhaustive reference.

Note, that util_correct_variable_use* are utility functions and hence intended for package internal use only. The package dataquieR does not export these functions, they will only be found if called from within that package or if called explicitly with the unadvised :::-operator during development. During drafting functions, we recommend import of all used functions to the global environment as follows:

util_correct_variable_use <- dataquieR:::util_correct_variable_use

Robustness checks

Checks for parameter not referring to variables can be performed using standard R functions such as is.numeric, na.fail, missing, is.null, length, stopifnot, inherits. Be careful with is.integer: This functions checks the declared type but not the real type of a vector:

a <- 12
is.integer(a)

## [1] FALSE

b <- as.integer(12)
is.integer(b)

## [1] TRUE

a == b

## [1] TRUE

identical(a, b)

## [1] FALSE

str(a)

##  num 12

str(b)

##  int 12

Therefore, we have included a utility function as proposed in the manual page of is.integer, which is called util_is_integer. This function behaves as expected and returns true also for the variable a from the example above. As for all utility functions, util_is_integer is not exported by the dataquieR package but can be accessed from functions in the package. Again we recommend copying the function to the global environment when drafting a function without compiling the package.

my_function_1 <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  
  # Replace the column names of the data in "study_data" by the corresponding short variable
  # labels. This step ensures comprehensive output. Convention: not more than 20 characters.
  
  # "meta_data" must provide a row for each column in "study_data", a unique and alphanumeric
  # label must be contained.
  
  translations <- setNames(meta_data[[label_col]], nm = meta_data$VAR_NAMES) # generate a named 
                                                                             # vector translating 
                                                                             # names to labels
  translationEnv <- as.environment(as.list(translations))  # convert it to an environment 
                                                           # for use with mget
  translated <- mget(colnames(study_data), translationEnv) # use mget to get translated 
                                                           # column labels
  ds1 <- study_data                                        # do not modify the original data frame
  colnames(ds1) <- unlist(translated)                      # use the translted as new column names
  
  r <- lapply(seq_along(ds1), function(v) {
    sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
  })
  
  names(r) <- colnames(ds1)
  r <- simplify2array(r)
  return(r)
}

The mapping of meta data variable labels and based on variable names is performed by the utility function util_prepare_dataframes, which can be used like a C Macro. After having called util_prepare_dataframes without arguments from a function that follows the here listed conventions, a new object is created in the function’s local environment named ds1. Using this, the function above will looks as follows:

my_function_1b <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  
  util_prepare_dataframes()
  
  r <- lapply(seq_along(ds1), function(v) {
    sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
  })
  
  names(r) <- colnames(ds1)
  r <- simplify2array(r)
  return(r)
}

Note, that util_prepare_dataframes is a utility function and hence intended for package internal use only. The package dataquieR does not export that function, it will only be found if called from within that package or if called explicitly with the unadvised :::-operator during development. During drafting functions, we recommend import of all used functions to the global environment as follows:

util_prepare_dataframes <- dataquieR:::util_prepare_dataframes

Once a function has been integrated to dataquieR, it will find the package internal functions without any tweaks.

Please refer to the full documentation of util_prepare_dataframes for an exhaustive reference.

my_function_2 <- function(resp_vars,     # vector of response variables, i.e. each of 
                                         # these variables is analysed 
                          co_vars,       # vector of additional variables used for 
                                         # adjustment or similar
                          group_vars,    # CAVE: currently only one grouping variable
                          label_col,     # meta data variable attribute to use for naming variables 
                                         # in the output
                          study_data,    # data frame of study records
                          meta_data      # data frame of meta data attributes
) {
  ## in case of a function that handles one variable at once:
  if (length(resp_vars) > 1)
    stop("my_function_2 cannot handle more than one variable at once.")
  # ...
}

All functions should carefully check all their input and abort the execution with understandable error messages, if some preconditions are not met. To cover the most common cases, some utility functions have been implemented (util_prepare_dataframes and util_correct_variable_use). util_prepare_dataframes checks for function it has been called by, if its mandatory standard function arguments study_data and meta_data provide the expected valid data and if these two data frames match. util_correct_variable_use can be called for each argument referring one or more variables by their names. It can be parameterised to check for the most common mistakes, e.g. too few / too many variable names, or referred variables of unsuitable data types.

Utility Functions

There are more utility functions except the two mentioned in the section Checks to be performed / robustness. All internal utility functions should be prefixed by util_. The util_ functions will not be exported by the R-package, because these are not intended to be used by end users directly. Because also the users of the functions will need some utility functions for processing data and generating quality reports, there are two more prefixes, namely prep_ for general data processing and pipe_ for stuff related to automated report generation.

Documentation

Documentation in this project is function specific, depending on whether the user is enabled to edit the code.

Please refer to roxygen2’s package documentation, R documentation about packages, and vignette.

Documentation of all exported Data Quality Indicator Implementations should be machine readable by Square2.

Data Quality Indicator Implementations and Reporting Functions

Exported dataquieR functions will be mostly used by the users and has therefore two routes for documentation:

Vignettes (on the website only): tutorial style, RMarkdown.
R manual (roxygen2).

Utility Functions

R manual (roxygen2): even if the functions are not exported, for internal documentation. Please avoid examples.

Input

Study data

The structure of study data has to comply with the following conventions to be applicable in our framework:

Study data is usually stored in tables (in R we use instances of the class data.frame, data frames).
Study data frames have one sample/patient per row and one variable per column. This corresponds to a “wide format”. Conversion from long/narrow format to wide format can be performed in R using several packages.
The column headers of study data frames are variable names.
Variable names must be unique.
Variable names do not contain blanks or other non-alphanumeric characters except for dots and underscores. They do not start with non-alphanumeric characters.
In case of repeated measurements, the names of variables measured repeatedly should receive a suffix indicating the measurement order (e.g. blood_01 blood_02 blood_03)

Meta data

Meta data are arguments for the indicator functions. They are provided to these functions as meta data frames in their function argument meta_data. For functions that handle only one variable at once the structure of the meta data will be identical as for multivariate functions. All functions extract the relevant columns from the full meta data frame provided to them.¹ For further details see the specific examples below.

Output

Elements

The output of a data quality function must contain the following elements:

The data quality related results as text, graph or table.
If possible, machine readable output of the data underlying the results (particularly for graphs), preferably in form of a data frame

It is desirable not to implement a new function for each output option. Returned data frames as well as ggplot2 based graphics can be modified and laid out later.

All DQI functions return named lists.

If unavoidable, we accept function parameters to control the output.

Overview of output elements

The output of the functions is given as a named R list. The following names are used:

SummaryTable
- a data frame with values about the data quality (e.g. the percentage of missings per variable)
SummaryPlot
- a ggplot2 graph visualizing the results

These will be amended by:

DQvalue
- a categorical value rating the data quality output (critical, undecided, good, …)

If a function provides specific output for a set of response variables (resp_vars missing or a vector), these specific outputs should be elements in the list, named by the VAR_NAMES. Additionally, such functions can still provide a SummaryTable and/or a SummaryPlot for all variables. Also a summary DQvalue should be available.

The example a function may generate the data frame below as the primary result:

df1

This data frame can then be used for a respective graph and both results are returned:

# call ggplot
  p1 <- ggplot(df1, aes(x = x1, y = y_prob)) +
        theme_bw() + 
        geom_bar( aes( fill = cave), stat = "identity") + 
        scale_fill_manual( values = c("#2166AC", "#B2182B"), guide=FALSE) +
        geom_errorbar( aes( ymin = lcl, ymax = ucl), width = 0.1) +
        geom_line(data =  df2, aes( x = x2, y = y_line, color = "#E69F00"), 
            size = 2) + 
        scale_color_manual( values = c("#E69F00"), guide = FALSE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

return(list(SummaryTable = df1, SummaryPlot = p1))

See Color concept.

Functionality of output

Data quality related output should:

not be too extensive (please create tailored ggplots)
allow for an overview over all checked data structures (e.g. variables)
allow for an overview over all checked data structures with a data quality finding
allow for an overview over all checked data structures with a data quality finding, crossing a defined threshold
use space as efficiently as possible
allow for an understanding of tables or graphs without using other information sources

Back to Overview

Function arguments in R: Technically, R passes arguments by reference but employs copy-on-write making arguments looking like being passed by value. Therefore, passing around large constant data frames is usually not a performance problem except for specific forms of parallel computing.↩︎