Achieving a high data quality is essential for the valid study of diseases, risk factors and consequences. This entails the need for informative data quality indicators and tools to assess and report data quality. Yet, despite many available works (e.g. Kahn et al. 2016, Weiskopf et al. 2013, Weiskopf et al. 2017, Nonnemacher et al. 2014), no standards have been achieved in our field of research. Existing data quality frameworks target registries, and electronic health records (EHR) rather than data that has directly been collected for research purposes.
A lack of common standards is partially due to the large heterogeneity of data structures and data collection processes (Keller et al. 2017). When understanding data quality as “the degree to which a set of inherent characteristics of data fulfills requirements” (ISO 8000), the heterogeneity is quite understandable. Requirements and their operationalizations differ considerably within and across areas of research, studies, or data bodies.
Against this background we developed a data quality framework (Schmidt et al. 2021) with related implementations (Richter et al. 2021) to facilitate standardized assessments of data quality. The core area of application are observational research data collections in medical research, yet applications are not limited to this area.
We focus intrinsic data quality, i.e. “data have quality in their own right” as opposed to contextual data quality “which highlights the requirement that data quality must be considered within the context of the task” (Wang and Strong 1996).
The former targets basic aspects such as (I) processable data, (2) complete data, and (3) error free data. These requirements are common to virtually all substantive scientific research. In contrast, “contextual data quality” is largely situation specific and it is more complicated to generate a uniform approach. Contextual examples are the availability of a relevant variable selection for some research question or enough power to conduct analyses.
The revised TMF guideline for data quality (Nonnemacher et al. 2014, Stausberg et al. 2019) was used as an initial point of reference for this work because it targets aspects of primary data collections. An empirical evaluation of indicators described by the TMF-guideline was conducted by representatives of the participating cohorts (Schmidt et al. 2019). This evaluation was used to identify indicators of particular relevance but also potential areas of improvement. The concept is described in the respective section.
One feature of importance is to provide not only a data quality framework but to accompany it by statistical implementations to facilitate and harmonize the assessments. The focus is R but also a Stata environment has been created, both are described in Software.
The development of the concept and implementations is still ongoing. Therefore, the scope of the content is expected to grow.