Abstract
This document is currently under heavy revision, so please contact us, if you want to contribute. Nevertheless, it provides guidance on the writing of R functions for data quality assessments to ensure exchangeable R-Code by using a homogeneous structure of functions and data. Therefore, conventions regarding the input/output of data are defined, naming conventions are introduced, and documentation requirements are denoted. In addition we provide recommendations regarding (machine-readable) output and visualization.
Code developed in teams, especially if intended to be used by a larger community, needs to be comprehensible for all team members and ideally also for anyone trying to use that code. Here we provide some conventions that resulted from the 1st project phase.
Please note that this document is work in progress, and some sections are not complete yet. Stay tuned for the updates.
The concept distinguishes two types of data sources:
Study data:
Clinical data: Measurements (organized within variables) intended to be subject to data quality assessments
Process information: all data providing information on the measurement process, such as time, ambient variables, the respective device or examiner.
Meta data:
Contain the expected characteristics of study data on different levels, e.g., on the level of each variable, labels, limits, or missing codes. Also the allocation of respective process information organized in variables.
Further tables referencing descriptions such as labels of missing codes.
For further information see Richter et al.
All implementations of the project were developed to be applied alone or in a vectorized reporting pipeline.
So-called process variables store meta data about the
measurement process. Content of these variables represent measurements
and are therefore stored with the study data. The names of the variables
to use are passed in a function argument. Process variable names can
also be stored in the attributes of a study variable. Such variable
attributes referring other variables are usually prefixed by
KEY_. Some such key attributes are listed in the table below. There is a wrapper function named
pipeline_vectorized
to automatically extract this
information from the meta data and to provision parallel function calls
with the respective function arguments. This is primarily used for
calling functions of the Dimension Accuracy for a set of
variables at once, because this dimension’s functions frequently
For brevity, we here present pseudo-code:
my_function_4 <- function( resp_var,
group_vars,
study_data,
meta_data
) {
s_data <- study_data[ , resp_var ]
group_data <- study_data[ , group_vars ]
# ...
Calling this function using pipeline_vectorized
would
work as follows:
named_list_of_results <-
pipeline_vectorized(fct = my_function_4, resp_vars = c("SBP_2", "DBP_2", "HF_2"), study_data = study_data,
meta_data = meta_data, label_col = LABEL,
args_from_meta = c(group_vars = KEY_OBSERVER),
mc.cores = 4)
# results are a named list of the univariate results:
named_list_of_results$SBP_2
named_list_of_results$HF_2
named_list_of_results$DBP_2
Later, this may also be extended by using classes for the variable based function arguments:
my_function_5 <- function( resp_var,
group_vars,
study_data,
meta_data
) {
s_data <- study_data[ , resp_var ]
if (inherits(group_vars, 'process_var_att')) {
group_data <- study_data[ , subset(meta_data, VAR_NAMES == resp_var, group_vars) ]
} else {
group_data <- study_data[ , group_vars ]
}
# ...
}
proc_var <- function(x) {
class(x) <- 'process_var_att'
x
}
my_function_5(
'SBP_0',
proc_var('KEY_OBSERVER'),
study_data,
meta_data
)
Functions addressing items/variables can address one variable only (univariate). Additionally, they can address a set of variables (multivariate).
prefixes: acc, con, com, int
dq_report2, prep_, summary, print
util_
Functions that not directly address QA issues but perform consistency checks, data preparation, pipelining and other auxiliary tasks are called Utility functions and described in the section Use of Utility Functions
# function calls:
my_function_1(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA,
label_col = 'LABEL', study_data = study_data, meta_data = meta)
## Error in my_function_1(resp_vars = colnames(study_data), co_vars = character(0), : could not find function "my_function_1"
try(
my_function_2(resp_vars = colnames(study_data), co_vars = character(0), group_vars = NA,
label_col = 'LABEL', study_data = study_data, meta_data = meta) # expect to stop
)
## Error in my_function_2(resp_vars = colnames(study_data), co_vars = character(0), :
## could not find function "my_function_2"
R code should be structured as follows (derived from http://style.tidyverse.org/ (Hadley Wickham), and inspired by https://google.github.io/styleguide/Rguide.xml):
# required packages/code should be specified prior to user-defined functions -----------------------
library(ggplot2)
# source required functions prior to the function --------------------------------------------------
# such code will be later embedded in the R-package
source("some_other_function.R")
my_function <- function(x, formal_1, formal_n) {
# start with all checks that safeguard applicability of the function ----------------
if (missing(x) || length(x) == 0L || mode(x) != "numeric")
stop("'x' must be a non-empty numeric vector")
if (missing(formal_1) || missing(formal_n))
stop("'attributes' must be specified")
# main body of the function ---------------------------------------------------------
x_mod <- ... # …
# call of nested function -----------------------------------------------------------
result <- some_other_function(x_mod)
# the output ------------------------------------------------------------------------
return(result)
}
Since the targeted output is an R library (namely
dataquieR
), library
and source
should only being used during the internal drafting of code. In
the R package, external libraries must be listed in the
DESCRIPTION
file (generally in its
Imports
-section) of the package and can be imported to the
package namespace using roxygen2
comments.
Please also refer to https://cran.r-project.org/web/packages/policies.html
To ensure a generic usability of R-scripts, they should be organised in functions whose input arguments must not be handled in a static fashion. This is necessary, because the names and the number of variables, meta data attributes, process variables as well as the names of the data frames are not known a priori. All functions must be able to handle whatever variables and data sets are used, as long as these meet some structural preconditions as outlined above.
This comprises:
No hard coded variable names
No hard coded expected lengths of variable lists
No hard coded data frame names
No function embedded meta data attributes
No hard coded thresholds for decision making of quality assessments
To avoid misunderstandings: Hard coded names of meta data attribute fields must be used to properly retrieve related information. All necessary information to run the scripts is transferred via an appropriate function call. There are some defined names for formals that would also provide metadata. These are described in the section Formals and arguments. These formals are filled by the Reporting Functions, if called from a pipeline.
We intentionally do not use the synonymous term “function parameters” to avoid ambiguities regarding the statistical term parameter related to probability distributions.
In the following, we give a table listing standardized function argument names. Functions can have additional arguments, but for the ones listed below, conventions exist. Also, these can be populated by Reporting Functions from the metadata when a report is being created.
Two formals are mandatory.
In the table above, there are two
arguments (study_data
and meta_data
) mandatory
for all indicator functions. These are declared to be data frames, which are explained in the following.
The table with function arguments lists
arguments mostly referring to study or process variables.
There may be additional parameters such as certain threshold values
or arguments affecting the format of the generated output like specific
colors or fonts. The latter should not be part of the functions in
future, because the output including ggplot2
plots can be
formatted later. The types of additional outputs depend on the specific
use-cases. Also thresholds may be generalized later so that using
threshold arguments is not recommended in favor of returning filterable
results.
All function arguments are user input. So these have to be verified carefully.
There are a number of utility functions for argument checking and preparation:
prep_prepare_dataframes
util_correct_variable_use
util_expect_data_frame
ls(asNamespace("dataquieR"), pattern = "^util.*")
For arguments referring to study variables, there is a family of
utility functions for this: util_correct_variable_use
and
util_correct_variable_use2
. These can check input arguments
referring to variable names. Some examples:
util_correct_variable_use("resp_vars", # check function argument resp_vars
allow_null = TRUE, # allow resp_vars being NULL
allow_more_than_one = TRUE, # allow more than one entry in resp_vars
allow_any_obs_na = TRUE, # allow resp_vars in study_data contain NAs (see stats::na.fail)
need_type = "integer | float") # allow variabes of metadata-declared types integer or float
util_correct_variable_use("group_vars", # check function argument group_vars
allow_null = TRUE, # allow group_vars being NULL
allow_more_than_one = TRUE, # allow more than one entry in group_vars
allow_any_obs_na = TRUE, # allow group_vars in study_data contain NAs (see stats::na.fail)
need_type = "!float") # allow variabes of all possible metadata-declared types except float
Please refer to the full documentation of util_correct_variable_use / util_correct_variable_use2 for an exhaustive reference.
Note, that util_correct_variable_use*
are utility
functions and hence intended for package internal use only. The package
dataquieR
does not export these functions, they will only
be found if called from within that package or if called explicitly with
the unadvised :::
-operator during development. During
drafting functions, we recommend import of all used functions to the
global environment as follows:
util_correct_variable_use <- dataquieR:::util_correct_variable_use
Checks for parameter not referring to variables can be performed
using standard R functions such as is.numeric
,
na.fail
, missing
, is.null
,
length
, stopifnot
, inherits
. Be
careful with is.integer
: This functions checks the declared
type but not the real type of a vector:
a <- 12
is.integer(a)
## [1] FALSE
b <- as.integer(12)
is.integer(b)
## [1] TRUE
a == b
## [1] TRUE
identical(a, b)
## [1] FALSE
str(a)
## num 12
str(b)
## int 12
Therefore, we have included a utility function as proposed in the
manual page of is.integer
, which is called
util_is_integer
. This function behaves as expected and
returns true also for the variable a
from the example
above. As for all utility functions, util_is_integer
is not
exported by the dataquieR
package but can be accessed from
functions in the package. Again we recommend copying the function to the
global environment when drafting a function without compiling the
package.
my_function_1 <- function(resp_vars, # vector of response variables, i.e. each of
# these variables is analysed
co_vars, # vector of additional variables used for
# adjustment or similar
group_vars, # CAVE: currently only one grouping variable
label_col, # meta data variable attribute to use for naming variables
# in the output
study_data, # data frame of study records
meta_data # data frame of meta data attributes
) {
# Replace the column names of the data in "study_data" by the corresponding short variable
# labels. This step ensures comprehensive output. Convention: not more than 20 characters.
# "meta_data" must provide a row for each column in "study_data", a unique and alphanumeric
# label must be contained.
translations <- setNames(meta_data[[label_col]], nm = meta_data$VAR_NAMES) # generate a named
# vector translating
# names to labels
translationEnv <- as.environment(as.list(translations)) # convert it to an environment
# for use with mget
translated <- mget(colnames(study_data), translationEnv) # use mget to get translated
# column labels
ds1 <- study_data # do not modify the original data frame
colnames(ds1) <- unlist(translated) # use the translted as new column names
r <- lapply(seq_along(ds1), function(v) {
sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
})
names(r) <- colnames(ds1)
r <- simplify2array(r)
return(r)
}
The mapping of meta data variable labels and based on variable names
is performed by the utility function
util_prepare_dataframes
, which can be used like a C Macro. After having called
util_prepare_dataframes
without arguments from a function
that follows the here listed conventions, a new object is created in the
function’s local environment named ds1
. Using this, the
function above will looks as follows:
my_function_1b <- function(resp_vars, # vector of response variables, i.e. each of
# these variables is analysed
co_vars, # vector of additional variables used for
# adjustment or similar
group_vars, # CAVE: currently only one grouping variable
label_col, # meta data variable attribute to use for naming variables
# in the output
study_data, # data frame of study records
meta_data # data frame of meta data attributes
) {
util_prepare_dataframes()
r <- lapply(seq_along(ds1), function(v) {
sum(meta_data[v, "INCL_SOFT_LIMIT_UP"] < ds1[, v])
})
names(r) <- colnames(ds1)
r <- simplify2array(r)
return(r)
}
Note, that util_prepare_dataframes
is a utility function
and hence intended for package internal use only. The package
dataquieR
does not export that function, it will only be
found if called from within that package or if called explicitly with
the unadvised :::
-operator during development. During
drafting functions, we recommend import of all used functions to the
global environment as follows:
util_prepare_dataframes <- dataquieR:::util_prepare_dataframes
Once a function has been integrated to dataquieR, it will find the package internal functions without any tweaks.
Please refer to the full documentation of util_prepare_dataframes for an exhaustive reference.
my_function_2 <- function(resp_vars, # vector of response variables, i.e. each of
# these variables is analysed
co_vars, # vector of additional variables used for
# adjustment or similar
group_vars, # CAVE: currently only one grouping variable
label_col, # meta data variable attribute to use for naming variables
# in the output
study_data, # data frame of study records
meta_data # data frame of meta data attributes
) {
## in case of a function that handles one variable at once:
if (length(resp_vars) > 1)
stop("my_function_2 cannot handle more than one variable at once.")
# ...
}
All functions should carefully check all their input and abort the
execution with understandable error messages, if some preconditions are
not met. To cover the most common cases, some utility functions have
been implemented (util_prepare_dataframes
and
util_correct_variable_use
).
util_prepare_dataframes
checks for function it has been
called by, if its mandatory standard function arguments
study_data
and meta_data
provide the expected
valid data and if these two data frames match.
util_correct_variable_use
can be called for each argument
referring one or more variables by their names. It can be parameterised
to check for the most common mistakes, e.g. too few / too many variable
names, or referred variables of unsuitable data types.
There are more utility functions except the two mentioned in the
section Checks to be
performed / robustness. All internal utility functions should be
prefixed by util_
. The util_
functions will
not be exported by the R-package, because these are not intended to be
used by end users directly. Because also the users of the functions will
need some utility functions for processing data and generating quality
reports, there are two more prefixes, namely prep_
for
general data processing and pipe_
for stuff related to
automated report generation.
Documentation in this project is function specific, depending on whether the user is enabled to edit the code.
Please refer to roxygen2
’s package documentation, R documentation about packages, and vignette.
Documentation of all exported Data Quality Indicator Implementations
should be machine readable by Square2
.
Exported dataquieR
functions will be mostly used by the
users and has therefore two routes for documentation:
Vignettes (on the website only): tutorial style, RMarkdown.
R manual (roxygen2).
The structure of study data has to comply with the following conventions to be applicable in our framework:
Study data is usually stored in tables (in R we use instances of
the class data.frame
, data frames).
Study data frames have one sample/patient per row and one variable per column. This corresponds to a “wide format”. Conversion from long/narrow format to wide format can be performed in R using several packages.
The column headers of study data frames are variable names.
Variable names must be unique.
Variable names do not contain blanks or other non-alphanumeric characters except for dots and underscores. They do not start with non-alphanumeric characters.
In case of repeated measurements, the names of variables measured repeatedly should receive a suffix indicating the measurement order (e.g. blood_01 blood_02 blood_03)
Meta data are arguments for the indicator functions. They are
provided to these functions as meta data frames in their function
argument meta_data
. For functions that handle only one
variable at once the structure of the meta data will be identical as for
multivariate functions. All functions extract the relevant columns from
the full meta data frame provided to them.1 For further details
see the specific examples below.
The output of a data quality function must contain the following elements:
The data quality related results as text, graph or table.
If possible, machine readable output of the data underlying the results (particularly for graphs), preferably in form of a data frame
It is desirable not to implement a new function for each output
option. Returned data frames as well as ggplot2
based graphics can be modified
and laid out later.
All DQI functions return named lists.
If unavoidable, we accept function parameters to control the output.
The output of the functions is given as a named R list
. The following names are used:
SummaryTable
SummaryPlot
ggplot2
graph visualizing the resultsThese will be amended by:
DQvalue
If a function provides specific output for a set of response
variables (resp_vars
missing or a vector), these specific
outputs should be elements in the list, named by the
VAR_NAMES
. Additionally, such functions can still provide a
SummaryTable
and/or a SummaryPlot
for all
variables. Also a summary DQvalue
should be available.
The example a function may generate the data frame below as the primary result:
df1
This data frame can then be used for a respective graph and both results are returned:
# call ggplot
p1 <- ggplot(df1, aes(x = x1, y = y_prob)) +
theme_bw() +
geom_bar( aes( fill = cave), stat = "identity") +
scale_fill_manual( values = c("#2166AC", "#B2182B"), guide=FALSE) +
geom_errorbar( aes( ymin = lcl, ymax = ucl), width = 0.1) +
geom_line(data = df2, aes( x = x2, y = y_line, color = "#E69F00"),
size = 2) +
scale_color_manual( values = c("#E69F00"), guide = FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
return(list(SummaryTable = df1, SummaryPlot = p1))
See Color concept.
Data quality related output should:
not be too extensive (please create tailored ggplots)
allow for an overview over all checked data structures (e.g. variables)
allow for an overview over all checked data structures with a data quality finding
allow for an overview over all checked data structures with a data quality finding, crossing a defined threshold
use space as efficiently as possible
allow for an understanding of tables or graphs without using other information sources
Function arguments in R: Technically, R passes arguments by reference but employs copy-on-write making arguments looking like being passed by value. Therefore, passing around large constant data frames is usually not a performance problem except for specific forms of parallel computing.↩︎