Software


R

dataquieR is an R package designed to conduct extensive, automated, and standardized data quality assessments based on the dimensions defined in the data quality framework by Schmidt et al. 2021. It can be applied to all sorts of tabular data, e.g., population-based cohort studies, registries, or electronic health record (EHR) data.

Extensive spreadsheet-type metadata with several tables, can be used to specify descriptions, expectations and requirements about the data in a standardized machine-readable way. More detailed information are described in the metadata annotation tutorial.

All existing implementations in dataquieR, including links to their respective documentation, are listed below. Additional examples, alternative implementations, and contributing code guidelines are available as tutorials.

Indicator functions

List of all Functions

These are functions from dataquieR that can be used to trigger single data quality checks. Their use is recommended for rather specific applications. It may be easier to use the dq_report2 function for standard reports.

Mapping the Concept to Functions

All dataquieR’s functions are linked to the underlying data quality concept as described in the table below.

Support functions

The indicator functions are aided by 413 support functions. The main task of these functions is to ensure a stable operation of dataquieR in light of potentially deficient data, which requires extensive data preprocessing steps.


Stata

In Stata, the package dqrep can be used for data quality analyses. It can be installed using the following command syntax:

net from https://packages.qihs.uni-greifswald.de/repository/stata/dqrep
net install dqrep, replace
net get dqrep, replace

Note: In rare case of issues when installing dqrep from the repository above please contact us.

Description

dqrep stands for “Data Quality REPorter”. This wrapper command triggers an analysis pipeline to generate data quality assessments. Assessments range from simple descriptive variable overviews to full scale data quality reports that cover missing data, extreme values, value distributions, observer and device effects or the time course of measurements. Reports are provided as .pdf or .docx files which are accompanied by a data set on assessment results. Reports are highly customizable and visualize the severity and number of data quality issues. In addition, there are options for benchmarking results between examinations and studies.

There are two essentially different approaches to run dqrep:

First, dqrep can be used to assess variables of the active dataset. While most functionalities are available, checks that depend on varying information at the variable level (e.g. range violations) cannot be performed. Any variable used in a certain role (e.g. observervars, keyvars) must be called for in varlist.

Second, dqrep can be used to perform checks of variables across a number of datasets that are specified in the targetfiles option. In addition, a metadatafile can be specified that holds information on variables and checks using the metadatafile option. This allows for a more flexible application on variables in distinct data sets, making use of all implemented dqrep functionalities.

For more details on the conduct of dqrep see this help file.


Schmidt, C.O., Struckmann, S., Enzenbach, C., Reineke, A., Stausberg, J., Damerow, S., Huebner, M., Schmidt, B., Sauerbrei, W., and Richter, A. (2021). Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r. BMC Medical Research Methodology 21, 1–15.