Literature about Data Quality

The list below contains selected literature references in the context of our work.

Paper
Struckmann et al. dataquieR 2: An updated r package for FAIR data quality assessments in observational studies and electronic health record data, 2024, Journal of Open Source Software, 9(98), 6581, https://doi.org/10.21105/joss.06581
Saleem et al. A review and empirical comparison of univariate outlier detection methods, 2021, Pakistan Journal of Statistics, 37(4)
Aguinis et al. Best-practice recommendations for defining, identifying, and handling outliers, 2013, Organizational Research Methods, 16(2), 270–301
A. A. for Public Opinion Research Standard definitions: Final dispositions of case codes and outcome rates for surveys, 2011, ”
Altman & Bland Assessing agreement between methods of measurement, 2017, Clin Chem, https://doi.org/10.1373/clinchem.2016.268870
Assenov et al. Comprehensive analysis of DNA methylation data with RnBeads, 2014, Nature Methods, 11(11), 1138
Bach The freiburg visual acuity test–automatic measurement of visual acuity, 1996, Optom Vis Sci, 73(1), 49–53, https://www.ncbi.nlm.nih.gov/pubmed/8867682
Bamberg et al. Whole-body MR imaging in the german national cohort: Rationale, design, and technical background, 2015, Radiology, 277(1), 206–220
Boehmke Data wrangling with r, 2016
Bakar et al. A comparative study for outlier detection techniques in data mining, 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems, 1–6
Bangia Dictionary of information technology, 2010
Bargaje Good documentation practice in clinical research, 2011, Perspectives in Clinical Research, 2(2), 59
Barnett & Lewis Outliers in statistical data, 1994
Begley & Ellis Drug development: Raise standards for preclinical cancer research, 2012, Nature, 483(7391), 531–533
Bennett How can i deal with missing data in my study?, 2001, Australian and New Zealand Journal of Public Health, 25(5), 464–469
Bretherton Reference model for metadata: A strawman, 1994, Whitepaper, University Wisconsin., https://pdfs.semanticscholar.org/f941/4454ef0e25ef102831ed8c7a4b6e9c094b00.pdf
Brown & Forsythe Robust tests for the equality of variances, 1974, Journal of the American Statistical Association, 69(346), 364–367
Callahan et al. A comparison of data quality assessment checks in six data sharing networks, 2017, eGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 5(1)
Chalmers & Glasziou Avoidable waste in the production and reporting of research evidence, 2009, Obstetrics & Gynecology, 114(6), 1341–1345
Chen et al. A review of data quality assessment methods for public health information systems, 2014, International Journal of Environmental Research and Public Health, 11(5), 5170–5207
Chang et al. Shiny: Web application framework for r, 2015, 2018, R Package Version, 1(0), 14
Callegaro et al. Web survey methodology, 2015
Cleveland et al. Regression by local fitting: Methods, properties, and computational algorithms, 1988, Journal of Econometrics, 37(1), 87–114
Cleveland & Devlin Locally weighted regression: An approach to regression analysis by local fitting, 1988, Journal of the American Statistical Association, 83(403), 596–610
Couchoud et al. Renal replacement therapy registries—time for a structured data quality evaluation programme, 2013, Nephrology Dialysis Transplantation, 28(9), 2215–2220
Das et al. A new method to evaluate the completeness of case ascertainment by a cancer registry, 2008, Cancer Causes & Control, 19(5), 515–525
Dasu & Johnson Exploratory data mining and data cleaning, 2003
Dong & Peng Principled missing data methods for researchers, 2013, SpringerPlus, 2(1), 222
Drion et al. Some distribution-free tests for the difference between two empirical cumulative distribution functions, 1952, The Annals of Mathematical Statistics, 23(4), 563–574
Durrleman & Simon Flexible regression models with cubic splines, 1989, Statistics in Medicine, 8(5), 551–561
Ebrahim & Davey Smith Commentary: Should we always deliberately be non-representative?, 2013, International Journal of Epidemiology, 42(4), 1022–1026
Edwards et al. Science friction: Data, metadata, and collaboration, 2011, Social Studies of Science, 41(5), 667–690
Fasano & Franceschini A multidimensional version of the kolmogorov–smirnov test, 1987, Monthly Notices of the Royal Astronomical Society, 225(1), 155–170
Feinstein & Cicchetti High agreement but low kappa: I. The problems of two paradoxes, 1990, Journal of Clinical Epidemiology, 43(6), 543–549
Filzmoser A multivariate outlier detection method, 2004
Finnie et al. EpiJSON: A unified data-format for epidemiology, 2016, Epidemics, 15, 20–26
Fletcher et al. Clinical epidemiology: The essentials, 2012
Freedman & Diaconis On the histogram as a density estimator: L 2 theory, 1981, Probability Theory and Related Fields, 57(4), 453–476
Golub & Van Loan Matrix computations johns hopkins university press, 1996, Baltimore and London
Gonzalez-Chica et al. Test of association: Which one is the most appropriate for my study?, 2015, Anais Brasileiros de Dermatologia, 90(4), 523–528
Grant Data visualization: Charts, maps, and interactive graphics, 2018
Hahsler et al. Introduction to arules-a computational environment for mining association rules and frequent item sets, 2010, 2018
Hallgren Computing inter-rater reliability for observational data: An overview and tutorial, 2012, Tutorials in Quantitative Methods for Psychology, 8(1), 23
Hansen et al. Enabling longitudinal data comparison using DDI, 2011
Harrell Jr Regression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis, 2015
Harris et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support, 2009, Journal of Biomedical Informatics, 42(2), 377–381
Hartge A dictionary of epidemiology, sixth edition, 2015, Am J Epidemiol, https://doi.org/10.1093/aje/kwv031
Hawkins Introduction, 1980, In Identification of outliers (pp. 1–12), https://doi.org/10.1007/978-94-015-3994-4_1
Hayat et al. Statistical methods used in the public health literature and implications for training of public health professionals, 2017, PloS One, 12(6), e0179032
Horton & Kleinman Using r and RStudio for data management, statistical analysis, and graphics, 2015
Hoyle et al. Metadata for the longitudinal data life cycle: The role and benefit of metadata management and reuse., 2010, DDI Working Paper Series: Longitudinal Data Best Practices, https://doi.org/http://dx.doi.org/10.3886/DDILongitudinal03
Hubert & Vandervieren An adjusted boxplot for skewed distributions, 2008, Computational Statistics & Data Analysis, 52(12), 5186–5201
Hu & Sung Detecting pattern-based outliers, 2003, Pattern Recognition Letters, 24(16), 3059–3068
Huebner et al. A contemporary conceptual framework for initial data analysis, 2018, Observational Studies, 4, 71–192, https://obsstudies.org/wp-content/uploads/2018/04/idarev2.pdf
Huser et al. Methods for examining data quality in healthcare integrated data repositories, 2017
Ioannidis Why most published research findings are false, 2005, PLoS Medicine, 2(8), e124
Ioannidis Discussion: Why an estimate of the science-wise false discovery rate and application to the top medical literature is false, 2013, Biostatistics, 15(1), 28–36
Ioannidis et al. Increasing value and reducing waste in research design, conduct, and analysis, 2014, The Lancet, 383(9912), 166–175
Jager & Leek An estimate of the science-wise false discovery rate and application to the top medical literature, 2013, Biostatistics, 15(1), 1–12
Jager & Leek Rejoinder: An estimate of the science-wise false discovery rate and application to the top medical literature, 2013, Biostatistics, 15(1), 39–45
Joshi et al. Likert scale: Explored and explained, 2015, British Journal of Applied Science & Technology, 7(4), 396
Jinyuan et al. Correlation and agreement: Overview and clarification of competing concepts and measures, 2016, Shanghai Archives of Psychiatry, 28(2), 115
Kahn et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data, 2016, eGEMs, 4(1)
Kalton The treatment of missing survey data, 1986, Survey Methodology, 12, 1–16
Kao & Green Analysis of variance: Is there a difference in means and what does it mean?, 2008, Journal of Surgical Research, 144(1), 158–170
Kahn et al. Quantifying clinical data quality using relative gold standards, 2010, AMIA Annual Symposium Proceedings, 2010, 356
Karr et al. Data quality: A statistical perspective, 2006, Statistical Methodology, 3(2), 137–173
Kalton & Kasprzyk The treatment of missing survey data, 1986, Survey Methodology, 12(1), 1–16
Keller et al. The evolution of data quality: Understanding the transdisciplinary origins of data quality concepts and approaches, 2017
Kleiber & Zeileis Visualizing count data regressions using rootograms, 2016, The American Statistician, 70(3), 296–303
Koo & Li A guideline of selecting and reporting intraclass correlation coefficients for reliability research, 2016, Journal of Chiropractic Medicine, 15(2), 155–163
Kullback & Leibler On information and sufficiency, 1951, The Annals of Mathematical Statistics, 22(1), 79–86
Kullback Information theory and statistics, 1997
Levene Robust tests for equality of variances, 1961, Contributions to Probability and Statistics. Essays in Honor of Harold Hotelling, 279–292
De Lusignan et al. Key concepts to assess the readiness of data for international research: Data quality, lineage and provenance, extraction and processing errors, traceability, and curation, 2011, Yearb Med Inform, 6(1), 112–120
Lang & Little Principled missing data treatments, 2016, Prevention Science, https://doi.org/10.1007/s11121-016-0644-5
Langeheine et al. Consequences of an extended recruitment on participation in the follow‐up of a child study: Results from the german IDEFICS cohort, 2017, Paediatric and Perinatal Epidemiology, 31(1), 76–86
Lee et al. A framework for data quality assessment in clinical research datasets, 2017, AMIA Annual Symposium Proceedings, 2017, 1080
Lehmann & Casella Theory of point estimation, 2006
Lenth et al. Least-squares means: The r package lsmeans, 2016, Journal of Statistical Software, 69(1), 1–33
Liaw et al. Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature, 2013, International Journal of Medical Informatics, 82(1), 10–24
Lindsey Comparison of probability distributions, 1974, Journal of the Royal Statistical Society. Series B (Methodological), 38–47
Lindsey & Mersch Fitting and comparing probability distributions with log linear models, 1992, Computational Statistics & Data Analysis, 13(4), 373–384
Little & Rubin Statistical analysis with missing data, 2014
Mayr et al. A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models, 2017, Statistical Methods in Medical Research, 26(3), 1443–1460
Mahalanobis On the generalized distance in statistics, 1936
Seo A review and comparison of methods for detecting outliers in univariate data sets, 2006
McMahon & Denaxas A novel framework for assessing metadata quality in epidemiological and public health research settings, 2016, AMIA Summits on Translational Science Proceedings, 2016, 199
Meyer et al. Efficient data management in a large-scale epidemiology research project, 2012, Computer Methods and Programs in Biomedicine, 107(3), 425–435
Mitchell et al. Data management using stata: A practical handbook, 2010
Morgenthaler A survey of robust statistics, 2007, Statistical Methods and Applications, 15(3), 271–293
Müller & Büttner A critical discussion of intraclass correlation coefficients, 1994, Statistics in Medicine, 13(23-24), 2465–2476
Nadkarni Metadata-driven software systems in biomedicine: Designing systems that can adapt to changing knowledge, 2011, https://doi.org/doi: 10.1007/978-0-85729-510-1
Consortium The german national cohort: Aims, study design and organization, 2014, European Journal of Epidemiology, 29, 371–382
Newsom Longitudinal structural equation modeling: A comprehensive introduction, 2015
Nohr & Olsen Commentary: Epidemiologists have debated representativeness for more than 40 years—has the time come to move on?, 2013, International Journal of Epidemiology, 42(4), 1016–1017
Nonnemacher et al. Datenqualität in der medizinischen forschung, 2014
Potter et al. Web application teaching tools for statistics using r and shiny, 2016, Technology Innovations in Statistics Education, 9(1)
Plantier et al. Biomedical engineering systems and technologies: 7th international joint conference, BIOSTEC 2014, angers, france, 3-6, 2014, revised selected papers, 2016
Porta A dictionary of epidemiology, 2014
Press & Teukolsky Kolmogorov-smirnov test for two-dimensional data: How to tell whether a set of (x, y) data paints are consistent with a particular probability distribution, or with another data set, 1988, Computers in Physics, 2(4), 74–77
Prinz et al. Believe it or not: How much can we rely on published data on potential drug targets?, 2011, Nature Reviews Drug Discovery, 10(9), 712
Priyadarshana & Sofronov Multiple break-points detection in array CGH data via the cross-entropy method, 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(2), 487–498
Ranganathan et al. Common pitfalls in statistical analysis: Measures of agreement, 2017, Perspectives in Clinical Research, 8(4), 187
Rasmussen & Blank The data documentation initiative: A preservation standard for research, 2007, Archival Science, 7(1), 55–71
Rossini et al. Simple parallel statistical computing in r, 2007, Journal of Computational and Graphical Statistics, 16(2), 399–420
Reineke et al. Modys–ein modulares steuerungs-und dokumentationssystem für epidemiologische studien, 2006, Medizinische Dokumentation–Wichtig Oder Nichtig
A. Richter et al. Data quality monitoring in clinical and observational epidemiologic studies: The role of metadata and process information, 2019, GMS Med Inform Biom Epidemiol, 15(1), https://doi.org/doi: 10.3205/mibe000202
R. Rigby et al. Distributions for modelling location, scale, and shape: Using GAMLSS in r, 2017, URL Www. Gamlss. Org.(last Accessed 5 March 2018)
R. A. Rigby & Stasinopoulos Generalized additive models for location, scale and shape, 2005, Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(3), 507–554
Risch Searching for genetic determinants in the new millennium, 2000, Nature, 405(6788), 847
Rossini et al. Simple parallel statistical computing in r, 2007, Journal of Computational and Graphical Statistics, 16(2), 399–420
Rothman et al. Why representativeness should be avoided, 2013, International Journal of Epidemiology, 42(4), 1012–1014
Rothman et al. Modern epidemiology, 2008
Rothwell External validity of randomised controlled trials: “To whom do the results of this trial apply?” 2005, The Lancet, 365(9453), 82–93
R Core Team R: A language and environment for statistical computing, 2020, https://www.R-project.org/
Ryssevik The data documentation initiative (DDI) metadata specification, 2001, Ann Arbor, MI: Data Documentation Alliance. Retrieved from Http://Www. Ddialliance. Org/Sites/Default/Files/Ryssevik_0. Pdf
Schafer & Graham Missing data: Our view of the state of the art, 2002, Psychol Methods, 7(2), 147–177, https://www.ncbi.nlm.nih.gov/pubmed/12090408
C. Schmidt et al. Square2-a web application for data monitoring in epidemiological and clinical studies, 2017, Studies in Health Technology and Informatics, 235, 549–553
C. O. Schmidt et al. Assessment of a data quality guideline by representatives of german epidemiologic cohort studies., 2019, MIBE, 15(1), https://doi.org/doi: 10.3205/mibe000203
Schmidberger et al. State-of-the-art in parallel computing with r, 2009, Journal of Statistical Software, 47(1)
Signorell et al. DescTools: Tools for descriptive statistics. R package version 0.99. 18, 2016, R Foundation for Statistical Computing, Vienna, Austria
Sison & Glaz Simultaneous confidence intervals and sample size determination for multinomial proportions, 1995, Journal of the American Statistical Association, 90(429), 366–369
Sniders & Bosker Multilevel analysis: An introduction to basic and advanced multilevel modeling., 1999
Stang & Jöckel Avoidance of representativeness in presence of effect modification, 2014, International Journal of Epidemiology, 43(2), 630–631
Stausberg et al. Indicators of data quality: Review and requirements from the perspective of networked medical research indikatoren zur datenqualität: Stand und anforderungen aus sicht der vernetzten medizinischen forschung, 2019, GMS Med Inform Biom Epidemiol, 15(1), https://doi.org/doi: 10.3205/mibe000199
Sterne & Smith Sifting the evidence—what’s wrong with significance tests?, 2001, Physical Therapy, 81(8), 1464–1469
Sturges The choice of a class interval, 1926, Journal of the American Statistical Association, 21(153), 65–66
Teppo et al. Data quality and quality control of a population-based cancer registry: Experience in finland, 1994, Acta Oncologica, 33(4), 365–369
Thygesen & Ersbøll When the entire population is the sample: Strengths and limitations in register-based epidemiology, 2014, European Journal of Epidemiology, 29(8), 551–558
Tukey Exploratory data analysis, 1977
Van der Loo The stringdist package for approximate string matching, 2014, The R Journal, 6(1), 111–122
Vardaki et al. A statistical metadata model for clinical trials’ data management, 2009, Computer Methods and Programs in Biomedicine, 95(2), 129–145
Vardigan et al. Data documentation initiative: Toward a standard for the social sciences, 2008, International Journal of Digital Curation, 3(1), 107–113
Wager et al. Model selection for penalized spline smoothing using akaike information criteria, 2007, Australian & New Zealand Journal of Statistics, 49(2), 173–190
Wang & Strong Beyond accuracy: What data quality means to data consumers, 1996, Journal of Management Information Systems, 12(4), 5–33
Watts et al. Data quality assessment in context: A cognitive perspective, 2009, Decision Support Systems, 48(1), 202–211
Nicole G. Weiskopf et al. A data quality assessment guideline for electronic health record data reuse, 2017, eGEMs (Generating Evidence & Methods to Improve Patient Outcomes), 5(1)
Nicole G. Weiskopf et al. Defining and measuring completeness of electronic health records for secondary use, 2013, Journal of Biomedical Informatics, 46(5), 830–836
Nicole Gray Weiskopf & Weng Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research, 2013, Journal of the American Medical Informatics Association, 20(1), 144–151
Organization International statistical classification of diseases and related health problems, 2004
Wilson Toward releasing the metadata bottleneck, 2011, Library Resources & Technical Services, 51(1), 16–28
Wickham Advanced r, 2014
Wickham R packages: Organize, test, document, and share your code, 2015
De Leeuw et al. Prevention and treatment of item nonresponse, 2003, Journal of Official Statistics, 19, 153–176
Carsten Oliver Schmidt et al. Facilitating harmonized data quality assessments. A data quality framework for observational health research data collections with software implementations in r, 2021, BMC Medical Research Methodology, 21(1), 1–15, https://doi.org/10.1186/s12874-021-01252-7
Völzke et al. Cohort profile: The study of health in pomerania, 2010, International Journal of Epidemiology, 40(2), 294–307, https://doi.org/10.1093/ije/dyp394
Adrian Richter et al. dataquieR: Assessment of data quality in epidemiological research, 2021, Journal of Open Source Software, 6(61), 3093, https://doi.org/10.21105/joss.03093
T. A. A. for Public Opinion Research Standard definitions: Final dispositions of case codes and outcome rates for surveys, 2016
ISO ISO 8000-1:2022 data quality part 1: overview, 2022, https://www.iso.org/obp/ui/#iso:std:iso:8000:-1:ed-1:v1:en
Stanley Smith On the theory of scales of measurement, 1946, Science, 103, 677–680