Loading Tree…

Definition

The degree of agreement between observed and expected distributions and associations.

Explanation

Indicators within the “Accuracy” subdimension are predominantly aimed at the detection of measurement error. Based on “Consistency” checks, it is possible to identify extraneous data realizations. However, a strong threat remains from values that are plausible but nevertheless wrong. It is far more difficult to detect them.

In most cases there is no gold standard or reference standard available against which a clear cut assessment would be possible. Thus, one way is to use knowledge about expected distributions or associations to detect data issues. Such issues may be a strongly skewed distribution if a symmetric distribution is expected or the presence of large mean differences across examiners when there should be no such difference. From a theoretical perspective there is some resemblance to the aspect of trueness according to ISO-35341, which defines this concept as “The closeness of agreement between the average value obtained from a large series of test results and an accepted reference value.”. However this does not extend to all indicators within this domain, foremost “Outliers”.

Indicators within the “Accuracy” subdimension allow for comparisons with a gold standard if available. This enables inferences about data errors at the single data values.

Example

In a study with a blood pressure examination conducted by five examiners there is an approximate random allocation of study volunteers to the examiners. The expectancy value for the mean should be the same for each observer. Therefore the variance proportion attributable to the examiners should be close to zero. However, based on a computed intraclass-correlation, the value is 0,05, indicating some relevant observer effects.

Guidance

While consistency related checks are mainly suitable to detect single extraneous values, measurement error in plausible values is far more likely to prove a large threat because there is little possibility to detect and avoid them at the time of data entry. Sufficient sample sizes are needed to detect issues which a sufficient degree of certainty which may hinder timely action.

Examiner effects pose a considerable threat and related issues are likely to entail time consuming training to reduce measurement error.

“Accuracy” check should be conducted after “Consistency” checks to avoid a double count of issues.

“Accuracy” related checks can only be applied to study data. There is little point in addressing metadata as metadata cannot be affected by measurement error.

Literature

  • Drion, EF, and others. 1952. “Some Distribution-Free Tests for the Difference Between Two Empirical Cumulative Distribution Functions.” The Annals of Mathematical Statistics 23 (4): 563–74.

  • Freedman, David, and Persi Diaconis. 1981. “On the Histogram as a Density Estimator: L 2 Theory.” Probability Theory and Related Fields 57 (4): 453–76.

  • Grant, R. (2019). Data visualization. Boca Raton, CRC Press.

  • Stausberg, J., D. Nasseh and M. Nonnemacher (2015). “Measuring data quality: A review of the literature between 2005 and 2013.” Stud Health Technol Inform 210: 712-716.

  • Weiskopf, N. G. and C. Weng (2013). “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research.” J Am Med Inform Assoc 20(1): 144-151.