Definition

The same data elements or data records appear multiple times.

Explanation

This indicator is related to violations of a uniqueness assumption for data elements, or observational units. The uniqueness assumption is a necessary precondition for any implementations.

Example

For a given data set it, it is known that each examined subject should only appear once. However in the loaded data set, it is observed that out of 700 subjects, 10 subjects are represented twice, leading to a data set with 710 rows. A further inspection of the data set leads to the following conclusion: ten subjects were examined during two instead of one visit, leading to two rows in the data file instead of one file. As part of a subsequent data management process, data for these subjects are merged into a single row, thereby restoring the required uniqueness assumption.

Guidance

Violations of the uniqueness assumption is a severe data quality problem, as information is incorrectly represented. Second, it may also lead to erroneous estimates of other data quality measures. For example measures of missing data may fail to capture the true amount of missing data if identical data structures are counted repeatedly.

Any deficit related to duplicates should be remedied by appropriate data management processes to restore the required uniqueness assumption. Afterwards the data quality reporting processes should be restarted again.

Issues with regards to duplicates normally become visible as DQI-1001 unexpected data elements or as DQI-1002 unexpected data records because duplicates imply discrepancies between expected and observed rows or columns in a data set. However, duplicate implementations are more specific with regard to the source of an error. In rare cases, duplicates may occur without a visible deviation on the two former indicators DQI-1001/1002, if some other error leads to the exclusion of the same number of data elements or data records as there are duplicates.

Interpretation

The higher the number or percentage of occurrences the lower the data quality.

Implementations

Literature

Lee K, Weiskopf N, Pathak J. A framework for data quality assessment in clinical research datasets. AMIA Annu Symp Proc 2017;2017:1080-9.
Kahn MG, Callahan TJ, Barnard J, et al. A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC). 2016;4(1):1244.
Nonnemacher M, Nasseh D, Stausberg J. Datenqualität in der medizinischen Forschung: Leitlinie zum Adaptiven Datenmanagement in Kohortenstudien und Registern. Berlin: TMF e.V..; 2014.
Stausberg J, Bauer U, Nasseh D, et al. Indicators of data quality: review and requirements from the perspective of networked medical research MIBE 2019;15(1):1-8.

Indicator “Duplicates”