Several automated data quality assessments can be created
simultaneously using the function dq_report_by
from the R
package dataquieR. This tutorial informs on how to specify the criteria
for splitting the data in several reports, as well as the definition of
the needed study data (i.e., the collected data) and
metadata (i.e., information and requirements about the
study data).
The function dq_report_by
can use multiple
levels metadata that can be specified in 2 ways:
in an Excel workbook with multiple spreadsheets containing
metadata organized in several tables. These spreadsheets are named
following dataquieR conventions: “item_level”,
“cross-item_level”, “segment_level”,
“dataframe_level”, “item_computation_level”. This type
of metadata is identified by the function argument
meta_data_v2
as separate metadata using the following function arguments:
meta_data
, for the “item_level” metadata;meta_data_segment
, for the “segment_level”
metadata;meta_data_dataframe
, for the “dataframe_level”
metadata;meta_data_cross_item
, for the
“cross-item_level” metadata;meta_data_item_computation
, for the
“item_computation_level” metadata;missing_tables
, for the table/s containing the missing
codes.Study data can be provided in two main ways in this function.
study_data
The argument study_data
can be used to specify one table
or multiple tables containing the collected data. They can be specified
using:
a path containing the file name and the file extension
study_data = "~/Desktop/sd1.xlsx", ...
And in case of multiple files, it can be a vector of several paths
study_data = c("~/Desktop/sd1.xlsx","~/Desktop/sd2.xlsx"), ...
a URL
study_data = "https://.../study_data.xlsx", ...
an object in R
study_data = sd1, ...
just the file name and the extension
In this case another argument input_dir
is mandatory to
state where the file is located.
study_data = "study_data.xlsx", input_dir = "~/Desktop/data/", ...
just the table name
This is a special case that works only with study data and metadata examples available from the dataquieR website.
study_data = "study_data" #for the synthetic example
study_data = "ship" #for the SHIP-based example
In this case the study_data
argument is not used and the
only table or several tables containing the data are listed in the
dataframe_level metadata
, in the column
DF_NAME
, a row per table, as in the following example.
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
DF_NAME | DF_ELEMENT_COUNT | DF_ID_VARS |
---|---|---|
~/Desktop/data/sd1.csv | 10 | v00001 |
~/Desktop/data/sd2.xlsx | 12 | v00001 | ID |
https://exampleURL/data/sd3.xlsx | 20 | v00001 |
If you want to provide only the table name and the extension in the
dataframe_level metadata
(as in the following example),
then the additional argument input_dir
is mandatory.
input_dir = "~/Desktop/data/", ...
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
DF_NAME | DF_ELEMENT_COUNT | DF_ID_VARS |
---|---|---|
sd1.csv | 10 | v00001 |
sd2.xlsx | 12 | v00001 | ID |
Resulting data quality reports can be stored in 2 ways.
The reports can be available inside R, in the Viewer panel. From there they can be opened in a browser. This is achieved using the argument:
also_print = FALSE
Reports can be saved in a folder on your device. In this case an
overview page containing a list of all created reports and the main data
quality issue summary is provided. To save the reports in a folder (for
example “Reports_july”), two arguments are needed,
also_print
set on TRUE and output_dir
to
define the folder to create and its path (Note: the folder can not exist
already):
also_print= TRUE, output_dir = "~/Desktop/Reports_july/", ...
You will have then a new folder “Reports_july” containing all the
reports. To open the overview page with the link to all the reports, you
will have to click on index.html
. Here is an example of how
the overview page appears.
Using the arguments listed up to now, it is possible to create multiple reports. By defaults if nothing is specified about how to separate data in several reports, the column “STUDY_SEGMENT” in the item_level metadata is automatically used.
Note: You can also decide that you do not want any
separation and obtain a unique report using
segment_column = NULL
.
Here is an example with the synthetic data of
dataquieR
.
dq_report_by(
study_data = "study_data",
meta_data_v2 = "meta_data_v2",
output_dir = "/REPORT_by_synthetic_data/",
also_print = TRUE,
dimensions = "Completeness")
This will create 5 reports, one per STUDY_SEGMENT in the
item_level metadata
.
Attention: The argument
dimensions = "Completeness"
limits the report to the
Integrity and Completeness dimensions (review concept at this link), and the descriptive
statistics. To obtain complete reports (including all possible
dimensions) the argument need to be set to null:
dimensions = NULL
Other arguments are available to define the reports to create.
To define how the data are separated in several reports, there are the following arguments:
segment_column
: name of an item_level
metadata attribute (i.e., column) usable to create reports for
groups of variables, e.g., Blood_test, Body_measurements. By default,
reports are created based on the content of STUDY_SEGMENT
,
if nothing is specified. Here is an example:
segment_column = "STUDY_SEGMENT"
or
segment_column = "REPORT_NAME"
segment_select
: if given, the reports are not
created based on all levels of segment_column
, but only for
the content of this vector. A character vector or a regular expression
(e.g., “.*_EXAM$“) can be provided. This argument can not be used if no
segment_column
is provided. Here is an example:
segment_select = c("STUDY", "LAB")
or
segment_select = "STUDY | LAB"
segment_exclude
: if given, the reports are not
created based on all levels of segment_column
, but only for
the levels that are not stated in this vector. A character vector or a
regular expression (e.g., “.*_EXAM$“) can be provided. It can only be
specified if a segment_column
is present. Here is an
example: segment_exclude = "^INT"
or
segment_exclude = c("STUDY", "INTERVIEW")
or
segment_exclude = "STUDY | LAB"
Note: There is only one exception in which both
segment_select
and segment_exclude
can be used
without specifying segment_column
and it happens only when
using levels of the column STUDY_SEGMENT
.
To define how the data are separated in several reports, there are the following arguments:
strata_column
: name of a study variable containing
categories usable to separate the reports, e.g. the study centers. Both
labels and VAR_NAMES
are accepted. In case of NAs in the
selected variable, a separate report containing the NAs subset will be
created. Here is an example: strata_column = "SEX"
strata_select
: if given, the categories of
strata_column usable to create reports are limited to the content of
this vector. A character vector or a regular expression can be provided
(e.g., “^a.*$“). This argument can not be used if no strata_column is
provided. Here is an example:
strata_select = "males | females"
strata_exclude
: if given, the reports are not
created based on all categories of strata_column
, but only
for the categories that are not stated in this vector. Here is an
example: strata_exclude = "1"
selection_type
: this argument is optional and can be
used to specify if you wrote an actual value (“value”), a value_label
(“v_label”), or a regular expression (“regex”) in
strata_select
or strata_exclude
. Here is an
example:
strata_exclude = "females", selection_type = "v_label",
It is possible to refine the selection of observational units, for example selecting only adults, or individuals taller than 170cm, using the argument:
subgroup
: if given, it can be used to define subgroups
of cases, using REDCap rules. For example, it can be used in case you
want to limit the reports to individuals of a certain age. For example
subgroup = "[v_age] > 30"
All previous arguments can be combined. So for example, using the synthetic example data you can decide to create complete reports for two study segments (“INTERVIEW” and “LAB”), only for females of age greater than 50. The code will be as follows:
dq_report_by(
study_data = "study_data",
meta_data_v2 = "meta_data_v2",
output_dir = "~/Desktop/REPORT_by_synthetic_data_v2/",
also_print=TRUE,
dimensions = NULL,
segment_column = "STUDY_SEGMENT",
segment_select = c("INTERVIEW", "LAB"),
strata_column = "SEX_0",
strata_select = "females",
selection_type = "v_label",
subgroup = [v00003] > 30)
Reports can be limited to a set of variables using the argument
resp_vars
. The reports will be created only using that
variables from the study data. The function is capable of retaining all
the referred variables needed to assess the variables indicated. These
referred variables remain in the metadata but they will not be part of
the data quality assessment. Here is an example on how to create a
report using only the six variables indicated.
dq_report_by(
resp_vars = c("CENTER_0", "PSEUDO_ID",
"SBP_0", "DBP_0",
"CRP_0", "BSG_0"),
study_data = "study_data",
meta_data_v2 = "meta_data_v2",
output_dir = "~/Desktop/REPORT_by_synthetic_data_v3/",
also_print=TRUE,
dimensions = NULL,
segment_column = "STUDY_SEGMENT")
In case of absence of the dataframe_level metadata,
it is possible to specify which variables contain the information about
the ID (useful in case of multiple data frames that need to be
combined). The argument to specify the ID variables is
id_vars
. For example id_vars = "v_id"