-------------------------------------------------------------------------------------------------------------------------------------------------------------------
help for dqrep                                                                                    (Carsten Oliver Schmidt 2018, last updated 2023/03, version 1.31)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Generate data quality reports


Syntax

    dqrep [varlist] [if] [in], [ targetfiles(strings) sd(strings) hd(strings) rd(strings) rdd(strings) srdd(strings) dqrd(strings) gd(strings) ld(strings) sdmd(
              strings) metadatafile(strings) dataquieR interpretationtextfile(strings) subgroupfolder(#) store not(varlist) lowercase(#) keyvars(varlist)
              minorvars(varlist) processvars(varlist) controlvars(varlist) observervars(varlist) devicevars(varlist) centervars(varlist) timevars(varlist) idvars(
              varlist) casemissvars(varlist) casemisstype(strings) casemisslogic(strings) reportname(strings) segmentname(varname) segmentselect(strings)
              segmentexclude(strings) varselect(varname) reporttitle(strings) reportsubtilte(strings) reportformat(strings) reporttemplate(strings) authors(
              strings) replacereport(int 0) maxvarlabellength view_interpretation(#) view_integrity(#) view_dqi(#) view_changelog(#) histkat varlinebreak
              sectionlinebreak linenumberpagebreak clustercolorpalettes(strings) decimals(#) widthadd(#) heightadd(#) language(strings) subgroup(strings)
              forcecalc(#) breakreport(#) itemmisslist(numlist) itemjumplist(numlist) outcheck(#) outsens(#) outintegrate(#) binaryrecodelimit(#) metriclevels(#)
              minreportn(#) minvarnum(#) minclustersize_icc(#) minclustersize_lowess(#) minevent_lowess(#) problemvarreport(#) gradingfile(strings) benchmark(#)
              resultreport(#) nomod(#) nocompress ]




Description
dqrep stands for "Data Quality REPorter". This wrapper command triggers an analysis pipeline to generate data quality assessments.  Assessments range from simple
descriptive variable overviews to full scale data quality reports that cover missing data, extreme values, value distributions, observer and device effects or the
time course of measurements.  Reports are provided as pdf or docx files which are accompanied by a data set on assessment results.  Reports are highly customizable
and visualize the severity and number of data quality issues.  In addition, there are options for benchmarking results between examinations and studies.

There are two essentially different approaches to run dqrep:

    First, dqrep can be used to assess variables within the active dataset. While most functionalities are available, checks that depend on varying 
	information at the variable level (e.g. range violations) cannot be performed. Any variable used in a certain role must be called in varlist.

    Second, dqrep can be used to perform checks of variables across a number of datasets that are specified in the targetfiles option. In addition,
    a metadatafile can be specified that holds information on variables and checks using the metadatafile option.  This allows for a more flexible
    application on variables in distinct data sets, making use of all dqrep functionalities.  Specifying any variables befor the comma in [varlist]
    will disable approach #2.

Note: Both approaches cannot be combined. Analyses will either cover variables in the active dataset or variables stored in a set of n study data files.

For more details on the conduct of dqrep read the sections: Requirements, Input, Instructions for the metadata file, Output, Notes, and Examples.



Options

Options specifically for the assessment of variables in the active dataset

    not(varlist) Specifies a list of variables to be excluded from assessments.

    [if] [in] Used to select cases of interest. For use with datasets that still need to be loaded, use the subgroup option instead.



Options related to study data files and result handling

Note, that a minimum input of folders is required only (sd suffices).
However, a wide range of folders is permitted and recommended in case of complex data quality assessments tasks.

    targetfiles(strings) Contains the name of all data files to be analyzed (without .dta suffix).  The file names must not contain blanks.  At least one file
        must be specified to run dqrep if an analysis is intended on variables other than those in the active dataset.  If more than one name is specified, the
        files are merged 1:1 based on one or multiple key variables (option idvars).

    sd(strings) Source directory with the Stata data files to be analyzed.  More than one file may be used but all files must be in this directory.  The modified
        data set is also stored here in a subfolder.  If nothing is specified the current working directory is used.

    rd(strings) Result directory for any reports in pdf or docx and Stata dta files containing all results.

    rdd(strings) Result data directory to store result files of data quality reports containing the merged study data set.  If nothing is specified, a subfolder
        "_dqrep_dataresultfiles" is generated in the source directory "sd".

    srdd(strings) Scalar result data directory to store numerical results of the data quality analyses.  These result files may be used to generate result
        overview reports.  If nothing is specified, a subfolder is generated in the source directory "rd".

    dqrd(strings) Directory to store data quality result reports.  If nothing is specified, the folder "rd" will be used.

    gd(strings) Result directory for graphical output. Graphical output is stored in an own folder because of the large number of related files.  If nothing is
        specified, a subfolder "dq_graph" is generated in the result directory "rd".  Within the "gd" folder two subfolder are created: One for thumb graphics of
        distributions and one for data quality report overviews.

    ld(strings) Name of the directory for log files.  If nothing is specified, a subfolder for log files within the results folder "rd" is specified.

    subgroupfolder(int 0) Specifies how to store results related to subgroups.  The default 0 leads to the use of the same folders across subgroups.  Choosing
        subgroupfolder=1 will lead to the separated storage of results for each targeted subgroup.

    lowercase(int 1) Whether or not to change variable names to lower letters (0=no, 1=yes:default) to easier ensure a match between the names in the study data
        and metadata files. By default lower letters are used.  The lowercase option will also convert strings in the metadatafile to lower case to enable a
        match of variable names.

    store Specify store to save all auxiliary output in addition to a log file and report such as graphs and result files.  If store is not specified all such
        output will be deleted.



Options related to metadata related files and folders

    sdmd(strings) Source directory of metadata files.  If a file is provided via "metadatafile" the folder should be specified using this option.  If nothing is
        specified the source directory folder sd is used.

    hd(strings) Name of the directory containing help files to enable report generation such as language and indicator files.  If nothing is specified the
        metadata directory folder sd is used.

    metadatafile(strings) Contains the name of the corresponding metadata file that provides additional information for improved data quality analyses.  The
        expected format is Excel (xlsx). The provision of a metadatafile is optional but strongly recommended, please see examples.

    gradingfile(strings) Specifies the name of the Excel xlsx file that conatins the rules and output formats for data quality gradings.  This file must be
        stored in the help file directory hd.

    interpretationtextfile(strings) Contains the name of an excel sheet that provides manual texts with flexible content to be included in the report.  The
        formatting permits addressing result output parameters.

    dataquieR If dataquieR is specified, dqrep assumes that metadata is delivered in the format of the R dataquieR package.  Note that only dataquieR columns
        will be used that have a correspondence in dqrep metadata.



Options related to variable selections and the role of variables in reports

    keyvars(varlist) List of all primary variables for data quality assessments.  For these, the most extensive computations take place. Commonly each variable
        of this type receives a dedicated output page with graphs.  Not mentioning `keyvars’ leads to a use of all variables as key variables unless they have
        been assigned to another variable category.

    minorvars(varlist) List of secondary variables for which a briefer scope of data quality assessments should take place.  Commonly each variable of this type
        receives only a table overview.

    processvars(varlist) These are variables that are predominantly related to process aspects of the examination such as examination times or ambient
        conditions.  Typically, they play no role as outcome variables.

    controlvars(varlist) Control variables are variables used to control in regression analyses, e.g. related to the estimation of cluster effects. Other than
        that, they are treated like minorvars.

    observervars(varlist) Cluster variables defined by observers.

    devicevars(varlist) Cluster variables defined by devices.

    centervars(varlist) Cluster variables defined by centers.

    timevars(varlist) Are time variables.

    idvars(varlist) Are ID variables which may be used to merge datasets or to check for duplicates

    segmentname(varname) If the report is to be conceived as a segment in more reports the name of the variable defining a segment needs to be specified.  It
        will be used for the appropriate generation of result reports to enable a distinct bar for each segment.  This is to be used in the absence of a segment
        name definition in a metadata file.

    segmentselect(strings) If segments are defined in the metadata file all segments to be included can be defined by segmentselect.  All segments will be
        targeted with one report per segment.

    segmentexclude(strings) If segments are defined in the metadata file all segments to be excluded can be defined by segmentexclude.  In this case all segments
        but the excluded one will be targeted with one report per segment.

    varselect(varname) Specifies a variable in a metadatafile that defines, which variables is to be included in a report.  Specifying '1' means that the
        variable is to be included. Any other code is ignored.  This option has been introduced to enable a more flexible reporting using distinct lists.

    casemissvars(varlist) List of variables which indicate variables to define unit/segment missingness.  The variables need to follow a hierarchical order with
        the first variable defining the first, and the following variables defining subsequent selection processes.

    casemisstype(strings) Provides optionally the definition of the missing variables `casemissvars’.  There should be as many definitions as variables,
        definitions must be a single word. This information should be provided to ensure a clear meaning of the respective variable.

    casemisslogic(strings) Specifies the logic to identify available observations.  Any logic must be provided as a single term without blanks.


Report formatting

    reportname(strings) Defines the name of the report to store results.  This name should be short and concise without blanks. If it contains blanks, they will
        be replaced by “_”

    reporttitle(strings) Defines the title of the report to be displayed in output documents.

    reportsubtitle(strings) Defines the subtitle of the report to be displayed in output documents.

    reportformat(strings) To select the format of report, either pdf (Default) or docx.

    reporttemplate(strings) Defines the scope of the report.If nothing is specified one of two default settings is chosen.  If a dataset is provided without any
        specification of metadata, dqrep assumes that a simple descriptive overview is of interest.  If metadata is specified, a standard data qualty report is
        requested.  The default may be overridden by specifying one of the followig options: "D" descriptive statistics only, "M" missingness only, "C"
        consistency only,"A" accuracy only, "Q" quality view in addition to other output, "tables" a full report with tabl n only, "standard" normal report scope
        with detailed information on key variables only, "extended" in addition to a standard report detailed coverage of minor and process variables as well.

    authors(strings) The authors of the report.  These appear on the title page below the report name.

    replacereport(int 0) Flag to replace an existing report: 0=no replacement, 1 always replace, 2 replace only pdf.

    maxvarlabellength(int 40) The maximum length of a variable label.  Abbreviated variable labels are forced to meaningfully display content.

    view_interpretation(int 1) Enter an empty interpretation part in the report.  This will only be realized in docx files. (0=no, 1=yes:default)

    view_integrity(int 1) Display information on the integrity of the variables regarding existence or variable type. (0=no, 1=yes:default)

    view_dqi(int 1) Display data quality classifications if available. (0=no, 1=yes:default)

    view_changelog(int 1) Display change log for variable modifications. (0=no, 1=yes:default)

    histkat(int 15) Number of categories up to which a display takes place as a bar chart. (Default is n=15).

    varlinebreak(int 1) Whether or not a page break occurs after each single variable table. (0=no, 1=yes:default)

    sectionlinebreak(int 1) Whether or not a page break occurs after each summary table and report section. (0=no, 1=yes:default)

    linenumberpagebreak(int 7) Number of rows required for a table to accept a pagebreak afterwards. (Default is n=7 rows in a table)

    clustercolorpalettes(strings) Specify a list of colorpalettes to be assigned to clusters.  The first palette is assigned to the first cluster (e.g.
        examiners, devices), the 2nd to the 2nd and so on.  The current default palettes are "s1 economist s2 burd s1r s2 plottig".  When specifying only one
        color the intensity is graded according to the number of clusters.  With m palettes and n clusters, if n>m, palette n is assigned to clusters m+1..m+n.

    decimals(#) Number of decimals to be displayed in output tables (Default n=2).

    heightadd(int 0) Add a constant to modify the height of graphs (Default n=0).  This may be of importance for example to enable page breaks at desired
        positions for the singe variable output.  Another use is the availability of larger graphs for use outside pdf or gocx reports.

    widthadd(int 0) Add a constant to modify the width of graphs (Default n=0).  This may be of importance for example to enable page breaks at desired positions
        for the singe variable output.  Another use is the availability of larger graphs for use outside pdf or gocx reports.

    language(strings) Report language (e=English; d=German).

    breakreport(#) By setting breakreport to 1, the threshold for ending report conduct is lowered.  Foremost, in case multiple reports are requested any
        critical incident will end computations with breakreport=1 where as per default even in case of an error report dqrep continues to compute the remaining
        reports (Default n=0).


Analysis settings

    subgroup(strings) Defines the logic to make a subgroup selection.  Example, if only persons younger than 30 years of age are to be selected:
        subgroup(age<30).  Do not use blanks.

    forcecalc(int 1) Force new calculations instead of taking existing results.  0=using available results and add only new ones to save computational time,
        1=calculate everything new (default))

    itemmisslist(numlist) A list of numerical values to be treated as missing values (value not encountered but expected).  The numerical lists may contain Stata
        missing codes (e.g. “.j” “.z”..).  A numerical list should contain at least one value and there are several options to simplify input:  90/99 -> “ 91 92
        93 94 95 96 97 98 99” / 900(10)990 -> “900 910 920 930 940 950 960 970 980” / 8(.5)10 90 91 96/99 -> “8 8.5 9 9.5 10 90 91 96 97 98 99”

    itemjumplist(numlist) A list of numerical values to be treated as permitted jumps (value not encountered and not expected).  These are codes indicating a
        non-existent values for some designed reason.  This may for example be not asking males about pregnancy.  The same input rules apply as for itemmisslist.

    outcheck(int 1) List of potential outlier checks to be performed:  1 Medcouple approach and Grubbs test (default); 2 All checks but Tukey; 3 All checks
        including Tukey; 10 only Medcouple approach (Rule according to G. Bray et al. (2005)); 11 only Standard deviation based (default 3*SD); 12 only Grubbs
        Test (default CI level 95); 13 only Adjusted Tukey (default p10-2*(p25-p5) / p90+2*(p95-75)); 19 only Tukey

    outsens(int 1) Adjustment for the sensitivity which which outlier checks are performed, default is 1, meaning no change.  Increasing the value increases the
        threshold by the chosen factor (e.g. entering 1.5 will alter the default 3*SD margin to 4.5 SD)

    outintegrate(int 1) Determines the way in which results from different checks are integrated, by default only the adjusted boxplot is used or the single
        selected test.  If nothing is selected the default will match the chosen tests specified with outcheck.  The following options exist:  1 Adaptive
        approach with Medcouple and Grubbs (default), both must indicate an outlier on the longer tail of the distribution, be positive only Medcouple approach
        on the shorter tail; 2 All selected checks must be positive; 10 Adjusted boxplot with medcouple must be positive; 11 Standard deviation based must be
        positive; 12 Grubbs Test must be positive; 13 Adjusted Tukey must be positive; 19 Tukey must be positive. Rather strict default settings are used because
        the interest is foremost in error outliers not influential observations per se.  The importance of these tests is greatly reduced if check ranges are
        specified within the metadata file.  Particulalry ranges for uncertain values mostly render outlier checks superfluous.

    binaryrecodelimit(int 8) Number of categories up to which a recoding should take place. If set to 0 recoding is suppressed.

    metriclevels(int 25) Number of categories after which a variable is classified as being of interval or ratio scale type.  This is an adaptable but
        potentially error prone heuristic if the respective metadata attribute scalelevel is not provided.

    minreportn(#) Minimum case number to generate a report, default is N=30.

    minvarnum(int 1) Minimum number of valid variables to generate a report, default is N=1.  Higher numbers may be specified to avoid nuisance reports.

    minclustersize_icc(#) Minimum cluster size to compute ICC values, default is n=10.

    minclustersize_lowess(#) Minimum number of cases to compute Lowess graphs, default is n=40.  Very low numbers may result in instable results.

    minevent_lowess(#) Minimum number of events for computations in Lowess for binary outcomes, default is n=2.

    problemvarreport(#) Produce an additional report which contains an in depth analysis of all variables assigned to an issue category n=# or higher.  Default
        is 0, no creation of an additional report. 1 will create a separate report.  This request makes sense in case of many variables to get a shorter overview
        with the problematic variables exclusively according to defined thresholds.

    resultreport(int 0) If a data quality report is to be generated from an already existing dqrep result files, the option resultreport must be set to 1.  If
        only a benchmark report is to be created based on a set of existing result files, the option resultreport must be set to 2.  In the latter case, regular
        data quality reports will be not be generated, only the benchmark report. This is important to save computational time.  In both cases, all relevant
        dqrep report files must be provided with the option targetfiles.

    benchmark(int 0) In case multiple reports are generated, specifying the benchmark option creates an additional report to benchmark results across the
        different separate reports, primarily based on the applied data quality gradings.  The number specified indicates the minimum problem level a variable
        must have to be displayed in the data quality overview table for better focus.

    nomod(int 0) Controls modification of variables related to range violations and outliers.  The default is all modifications permitted to ensure that common
        data quality issues are only counted in one indicator.  Modifications can be fine-tuned but at the price of loosing clarity in the assignment of
        indicators.  Disabling modifications mainly makes sense when running selected functionalities in the pipeline only (see reporttemplate). The options are:
        0 'all modifications permitted' (default); 1 'variable modifications only permitted for inadmissible values and uncertain values'; 2 'variable
        modifications only permitted for inadmissible values'; 3 'no variable modifications due to detected extreme values, uncertain values and range violations
        permitted'; -1 'variable modifications only permitted for extreme values'.

    nocompress As default the workig data set is compressed to better assess data types.  However a user may also whish to assess the data types as provided
        originally. In this case the option nocompress must be specified.



Requirements
At least Stata 15 is necessary to generate pdf or docx reports because dqrep makes use of putdocx or putpdf.  dqrep outputs log files in earlier Stata versions.
Tests have been conducted as of Stata 13.  To run properly, dqrep must have sufficient rights to write files into the specified working directory.

Please install the ado's linest, catplot, colorpalette, coefplot, grubbs, robstat, and moremata to ensure proper functioning of dqrep.  If not installed, dqrep
stops with a warning message.

Note that for some ado's, the most recent version must be installed and Stata should be restarted, e.g. with robstat and moremata, to ensure proper functioning.



Input

Multiple study data files can be specified.  To merge different data files, multiple key variables may be named to ensure a 1:1 match using the idvars option.
Variable names must be unique across data files, except for variables used for matching.

As specified in the examples below, the full dqrep functionalites can only be used if a 2nd file, a spreadsheet type metadata file is being provided.  Such
metadata contains for example labels, missing value codes, admissible values, and provides links to process variables such as examiners or devices.  Only a single
metadata file is expected.  The structure is explained in the following section.

Alternatively, metadata can be provided via the command call.  This is only useful for information which does not vary across study variables and may result in
missleading results otherwise.

For further details, please see the examples below.


Instructions for the metadata file

All columns in the metadata file must adhere to defined formatting rules (e.g. defined column names and content formatting).  Only columns with the default names
as specified below will be read in and ignored otherwise. Adhere to lower case column names.  Metadata is expected in an Excel xslx format. The xlsx format was
chosen to facilitate nontechnical editing of metadata content.  The following information can be used by dqrep, the expected column name in the first row is
displayed in bold.

    var_name The variable name. It must be an exact match with the corresponding name in the study data files to match content. Variable names must
    be unique.

    varlabel The label of the variable

    varshortlabel The short label of the variable. This can be provided to ensure easier readable graphs.

    value_label A value list may be provided. It will either work if it adheres to STATA syntax 0 "no" 1 "yes" 2 "don't know" or in this format 0=no
    | 1=yes | 2=don't know

    data_type Used to specify the data type of the variable. The data type is used for integrity checks of the data.  A basic classification,
    consisting of four data types is used:  - string: for character or string variables - integer: for variables containing integer values only -
    float: for variables with decimals - datetime: for any variable in a date / time formats Entries other than these will be ignored, and if needed
    mapped before using dqrep.

    scalelevel Assigns the appropriate scale level to a variable. The accepte values are 'nominal', 'ordinal', 'interval', and 'ratio'.  For
    explanations see the respective literature. This attribute controls the conduct of certain statistics such as range checks, outliers, and data
    management procedures.

    missinglist A list of values (numeric or stata missing codes) that indicate a missing value.  (Example data field entry in column 'missinglist'
    998 999 .x.z)

    jumplist A list of values (numeric or stata missing codes) that indicate a jump in the data.  This data fields where a data value was not
    expected due to the design.  (Example data field entry in column 'jumplist': 777 888 .j)

    refcat Used to recode categorical variables to binary.  refcat contains a numlist, optionally also Stata missing codes, to define the reference
    category (=0).  (Example data field entry in column 'refcat': 0 1 2)

    eventcat Used to recode categorical variables to binary.  eventcat contains a numlist, optionally also Stata missing codes, to define the event
    category (=1).  (Example data field entry in column 'refcat': 3 4 5 6 7)

    limit_hard_low The lower bound of an inadmissibility limit.  (Example data field entry in column 'limit_hard_low': >0)

    limit_hard_up The upper bound of an inadmissibility limit.  (Example data field entry in column 'limit_hard_up': <300)

    limit_soft_low The lower bound of an uncertainty limit.  (Example data field entry in column 'limit_soft_low': >=50)

    limit_soft_up The upper bound of an uncertainty limit.  (Example data field entry in column 'limit_soft_up': <=150)

    key_observer Links the variable name of some examiner, observer, reader to the variable var_name to assess observer effects.

    key_device Links the variable of some device to the variable var_name to assess device effects.

    key_datetime Links a date time variable to the variable var_name to assess time-trends.

    variablerole Assigns a role to the variable var_name for the report.  The used strings corresponds to the options: keyvars, minorvars,
    processvars, controlvars, observervars, devicevars, centervars, timevars, idvars, please see explanations above.

    var_order Specifies the numerical order of variables in the report, integer input is expected. The order of variables is sorted accordingly.

    sourcefilename The name of the stata dta file, in which some variable var_name is contained. This is useful for reports for which various data
    files need to be merged.  If a metadata file is provided but no source data file specified with the option targetfiles, dqrep assumes that a
    column sourcefilename is available.  If not, program execution stops.

In addition, the metadata file may contain attributes with freely choosable names that are specified via a command option:

    segmentname the variable naming the segment to which a variable belongs may be specified.  The segment may be some unit of a study like an
    examination.  This column must be specified to generate multiple reports at once.

    varselect Specifies a variable in a metadatafile that defines, which variables is to be included in a report.  Specifying '1' means that the
    variable is to be included. Any other code is ignored.

A metadata file and command syntax specifications may be combined.  If done so, command option settings override information in the metadata file.

An example for a metadata file is the ancillary metadata file SHIP_metadata.xlsx.



Output
The main output formats are pdf and docx files as well as machine-readable result summaries to facilitate post-processing.
If requested via the store option, additional output is stored as follows:

The program creates a number of default results and folders as follows, however default options may be overriden:

    A results folder is created in the requested rd folder.  It contains the report file in the requested format (current default pdf, alternatively docx)
    and a log file.
        
        A subfolder named "_dataresultfiles".  It contains all obtained results in xlsx and dta format, including distinct overviews on the
        generated data quality output.

        A subfolder "graph" containing all graphs (png).



Notes

Because ancillary files may be loaded and checked before the actual study data are checked, the common approach of loading the single file of interest before
calling dqrep does only work with limitations.  Rather, all files of interest should be specified in the program call.  Because dqrep will write files into the
working directory, make sure that it is appropriately chosen.  If only a source directory is specified, any results will be written to that folder.  If no
directory is specified, dqrep will use the current working directory and create subfolders therewithin.

dqrep aims to create usable results even with deficient data sets and with a priori unknown variable properties.  This may encompass the need to modify variables
in order to ensure a stable performance of the analysis pipeline.  For the sake of transparency, such modifications are either reported in the integrity output or
in a change-log.  However, default settings can mostly be overriden via dedicated command options.  The most important potential data modifications are explained
below:

    To reduce chances for mismatches between the variable names in the study data files and their counterparts in the metadata file, by default, all variable names
    are changed to minor letters.  This has proven useful because imports from non-case specific data sources may cause problems.  This option can be disabled
    using the lowercase option.

    dqrep tries to ensure that data quality issues are only counted once.  Therefore, when identifying for example a range violation, this value is then changed to
    missing by default to avoid it being an issues in subsequent checks like extreme values.  However, modifications can be suppressed, using the nomod option.

    Should the dataset contain a temporary Stata variable name type such as '__000...', the variable is deleted with a log error message in the output and no
    further effects on the analysis pipeline.

    dqrep works and contrasts results based on the provided original variables and their modified counterparts.  For example, while an original variable remains as
    is, the modified counterpart may contain values with range violations or extreme values as missing to ensure proper performance in later stages of the
    pipeline.  So each variable is doubled in the working data set.  Currently, modified variables are stored in a second variable with suffix "m"[varname].  To
    avoid rather rare problems with data sets which contain variables that only differ by a suffix 'm', the variable name is altered by adding a '_' as suffix to
    the variable name to avoid this problem.

    There are automated checks for duplicate entries.  Some categories are “weaker” than others to avoid conflicts, if a variable is mentioned in more than one
    category.  This leads to a deletion of identical variables, which have been assigned to more than one category from the weaker category.  The hierarchy”in case
    of double mentioning of a variables is 1.  idvars 2.  casemissvars 3.  segmentmissvars 4.  timevars 5.  observervars 6.  devicevars 7.  centervars 8.
    processvars 9.  keyvars 10.  controlvars 11.  minorvars Example: The variable “observer” has been included in the variable lists observervars and keyvars.
    Because keyvars is ranked lower than observervars it is deleted from the keyvars list.  Not, that nevertheless full output can be requested through the
    reporttemplate option.

    Producing useful output for time-trends and group comparisons of categorical variables as part of a data quality assessment poses problems for several reasons.
    First, graphical readability.  Second, excessive output quantity, especially when making assessments across clusters (e.g. with many examiners and devices).
    Third, computational stability and interpretability with sparse cell counts.  Therefore, dqrep has implemented the following simplification as default:
    Categorical variables with more than two levels are collapsed to two levels. If nothing else is specified, the default algorithm is to use the most frequent
    category as reference and all other categories as the event group.  Information on the conducted changes forms part of the output.  In applied settings for
    data quality assessments, this was a workable simplification.  However, this may not be true in all cases.  If for categorical variables all or selected single
    categories are of relevance, they currently must be provided in a dummy coded form.

    If a variable of string type is encountered, dqrep assumes that it has been mistakenly provided and tries to transform it to a numeric variable. Related
    warnings appears in the integrity section of a report.  If successful, dqrep proceeds with the converted variable, if not, the variable is omitted from further
    reporting.

    If a supposed date-time variable, as provided with the option timevars is of no such format, dqrep tries to convert it. If not successful, dqrep terminates.

    If a normal data quality report is requested, but no date-time variable is specified, dqrep assumes that the order of data records can be used as a substitute
    for the temporal order and an artificial order variable is created along related warning output in the integrity section.



Limitations

There exist many options to tailor the use of dqrep.  Nevertheless, it has predominantly been designed with standard sequences of data quality checks in mind that
proved to be important in the context of observational health studies.  dqrep may not be suitable for highly specific assessment tasks due to the restricted range
of checks. Extensions may be requested.

dqrep makes use of more than 60 ados and tries as much as possible to deliver meaningful results with deficient data.  Yet, due to the many dependencies within the
package uncontrolled program abortions may seldomly occurr or issues related to the presentation of results.  It is combinatory not possible to test all possible
combinations of dqrep options and data constellations that may affect report conduct and output.  Reporting issues back would therefore be of interest to further
increment the robustness and ergonomy of the analysis pipeline.

dqrep currently does not assess string variables (see the Notes section).

STATA may crash when handling reports with many tables. Therefore the number of single variables should not surpass about 100 in one report if a single variable
view is requested.  After a prolonged use of dqrep, STATA seldomly produces cryptic error messages. Restarting Stata commonly solves the problem.


Examples

    A sample dataset with anonymized data from the Study of Health in Pomerania (SHIP) is used for illustrative purposes.  It is provided as ancillary data from
    the same site as dqrep.  The working directory should be specified as follows:

    . cd [your working directory containing SHIP_study.dta and SHIP_metadata.xlsx]


    Example 1a. Creating a descriptive overview of variable properties in blood pressure variables of an active data set (e.g. if SHIP_study.dta is active)

    use SHIP_study.dta, clear
    dqrep *bp*

    Example 1b. Creating a descriptive overview of all variables in the data set SHIP_study.dta. Note, a default report name is chosen because . reportname has
    not been specified and the pdf/docx file from previous example 1a may be overwritten

    dqrep , targetfiles("SHIP_study")

    This creates a single pdf table overview with descriptive variable properties with an additional section on integrity warnings in a default subfolder
    “DQ-resultfolder”.  Single aspects can be changed, for example suppressing the integrity output by adding view_integrity(0).  The output also shows that
    variables like blood pressure do not seem to be presented well. That is because missing codes are not recognized.


    Example 2. Creating a descriptive overview of variable properties in a data set with specifications of missing values

    dqrep , rd(Example2) targetfiles("SHIP_study") itemmisslist(99900 99901 99902 99914) itemjumplist(99800 99801 99802)

    The basic output is comparable, only that now the distributions look better because missing values are recognized in a better way.  dqrep distinguishes two
    classes of reasons why values are missing:  Codes that indicate values that should have been collected but are not present are listed in itemmisslist.  Codes
    that indicate values that were not intended to be collected or available in the data set are listed in itemjumplist.  This distinction is essential for
    appropriate missing value analyses. Also a dedicated results directory has been specified.


    Example 3. Creating a full data quality report with metadata provided via command syntax

    dqrep, rd(Example3) targetfiles("SHIP_study") ///
           itemmisslist(99900 99901 99902 99914) itemjumplist(99800 99801 99802) ///
           reportname("SHIP-Samplereport") reporttitle("SHIP-0 Data quality report") ///
           reportsubtitle("Report using anonymized SHIP-0 sample data") ///
           reportformat("docx") keyvars("sbp1 sbp2 dbp1 dbp2") ///
           minorvars(cholesterol stroke diab_known waist weight contraception) ///
           observervars(obs_bp)devicevars(dev_bp) controlvars(age sex) idvars(id) timevars("exdate") store

    This command provides metadata information via dqrep options. These specify missing value codes (itemmisslist), permitted jump codes (itemjumplist), report
    names, titles, and subtitles.  A docx file is requested. This report emphasizes blood pressure related variables (keyvars) with a single report page per
    variable.  Only a table output is requested for a range of other variables (minorvars). Furthermore, examiner (observervars), and device variable (devicevars)
    are identified, and variables to control for, when checking observer or device effects (controlvar The id variable is identified (idvars). Finally, the storage
    auf all auxiliary files is requested.


    Example 4. Creating a full data quality report using a xlsx spreadsheet to provide metadata

    dqrep, rd(Example4) metadatafile("SHIP_metadata.xlsx") store

    While being much shorter, this command is very powerful because the spreadsheet allows for detailed provision of metadata.  The main difference to the former
    example is that all attributes may vary at the level of individual variables.  For example, distinct process variables can be assigned to any variable in the
    report.  In addition, checks based on variable specific admissibility ranges become possible.


    Example 5. Creating multiple data quality reports with a benchmarking of graded data quality issues across reports

    dqrep, rd(Example5) metadatafile("SHIP_metadata.xlsx") ///
           reporttitle("SHIP-0 Data quality report") ///
           segmentselect(INTERVIEW LABORATORY SOMATOMETRY) segmentname(segments) benchmark(3)

    In addition to the previous command, a variable is defined using the segmentname option to provide the name of a variable that, in this case, contains
    information on different areas of the examination, so called segments.  For each segment, a distinct report is now created.  In addition, specific segments may
    be selected for selective inclusion using the segmentselect option.  To get a comparison of the obtained gradings the benchmark option is specified in
    addition.  In the current example, this leads to three separate data quality reports with a brief benchmarking overview of the assigned quality grades.  Using
    the segmentname option is also a strategy to handle studies with too many variables for a single report by assigning different 'artificial' segments.  Note,
    that in reports without defined key variables, no detailed section for single appears.

    Example 6. Creating multiple data quality reports with some benchmarking and detailed overview for all variables

    dqrep, rd(Example6) metadatafile("SHIP_metadata.xlsx") ///
           reporttitle("SHIP-0 Data quality report") ///
           segmentselect(INTERVIEW LABORATORY SOMATOMETRY) segmentname(segments) ///
           reporttemplate("extended") benchmark(1)

    Everything is comparable to the previous report, only that now all reports contain a single variable section because the reporttemplate "extended" has been
    requested.



Authors 
 
Prof. Dr. Carsten Oliver Schmidt
SHIP-KEF Quality in the Health Sciences, University Medicine of Greifswald, Germany

Email carsten.schmidt@uni-greifswald.de for comments and problems.
                        
        
        
        
Acknowledgements

This work was supported by the German Research Foundation (DFG: SCHM 2744/3-1, NFDI 13/1, SCHM 2744/9-1), by the European Union's Horizon 2020 research and
innovation programme under grant agreement No 825903 (euCanSHare project).
Feedback and technical support from team members was essential, e.g. from Dr. Birgit Schauer, Dr. Janka Schössow, Dr. Stephan Struckmann.




Bibliography and Sources

Schmidt CO, Struckmann S, Enzenbach C, et al. Facilitating harmonized data quality assessments.  A data quality framework for observational health research data
   collections with software implementations in R. BMC Med Res Methodol 2021; 21.
Richter A, Schössow J, Werner A, et al. Data quality monitoring in clinical and observational epidemiologic studies: the role of metadata and process information.
   MIBE 2019; 15.