How to import the data needed to generate data quality reports with dataquieR

Automated data quality assessments using the R package dataquieR work by using the collected data (study data), and information and requirements about the study data (metadata). This tutorial informs about how to import study data and metadata to generate data quality reports with dataquieR.

The functions in dataquieR can use 2 types of metadata:

  • Item level metadata: a csv file or a single spreadsheet in an Excel file containing information and expectations about single variables. It is identified by the function argument meta_data =

  • Multiple levels metadata: an Excel workbook with multiple spreadsheets containing metadata organized in several tables named following dataquieR conventions: “item_level”, “cross-item_level”, “segment_level”, “dataframe_level”. This type of metadata is identified by the function argument meta_data_v2 =

See the metadata tutorial for more information.

You will need the dataquieR and rio packages installed for this tutorial. Here is the code to install both, if needed.

install.packages("dataquieR")
install.packages("rio")



First case: import data to create a report. dq_report2 does it all

The dq_report2() report generation function does not require the study data and metadata to be uploaded beforehand. Files are directly imported by the function using a path, a URL, or just the file name in case of the sample data shipped with the package. Hereafter there is an example for each case.

1. Local files (use their path)

Both study data (study_data1.csv) and metadata (metadata1.xlsx) are locally available in the “Downloads” folder.

dq_report2(study_data = "~/Downloads/study_data1.csv", 
           meta_data_v2 = "~/Downloads/metadata1.xlsx")

2. Data available online (use the URL)

Data can be directly downloaded using the URL

dq_report2(study_data = "https://.../study_data1.csv", 
           meta_data_v2 = "https://.../metadata1.xlsx")

3. Study sample data from dataquieR (use only the file name)

Only in the case of the two sample study data and metadata shipped with the dataquieR package, you can specify the name of the file without the path nor the extension.

dq_report2(study_data = "ship", meta_data_v2 = "ship_meta_v2") # for the synthetic example data
dq_report2(study_data = "study_data", meta_data_v2 = "meta_data_v2") # for the SHIP-based example data

4. Data previously imported as objects in R

In the case of data being already available in the R Global Environment, you can specify the object names in the function. In the following example, the study data is the object sd1 and the item level metadata is the object md1.

dq_report2(study_data = sd1, 
           meta_data = md1) 



Second case: import data to apply dataquieR functions

To apply single functions in dataquieR or to inspect the data is necessary to import the data. The dataquieR functions for importing data are:

  • prep_get_data_frame

  • prep_load_workbook_like_file

  • prep_load_folder_with_metadata


I. Import single data files

Study data and item level metadata can consist of single spreadsheet in an Excel file or a csv file. You can import them using the dataquieR function prep_get_data_frame in the following ways depending on their location.

1. Import local data files

Study data (study_data1.csv) or item level metadata (item_md1.xlsx) are available in a local directory, for example in “Documents” inside a folder named “data”.

library(dataquieR)
sd1 <- prep_get_data_frame("~/Documents/data/study_data1.csv") 
md1 <- prep_get_data_frame("~/Documents/data/item_md1.xlsx") #There is only one spreadsheet in this Excel file
# This path is just an example, replace it with the correct path to the file you want to import

2. Import data files from a URL

Study data (study_data1.csv) or item level metadata (item_md1.xlsx) can be also imported using a URL with the prep_get_data_frame function.

sd1 <- prep_get_data_frame("https://.../study_data1.csv")
md1 <- prep_get_data_frame("https://.../item_md1.xlsx") #There is only one spreadsheet in this Excel file

3. Import dataquieR study data or item level metadata examples

To import an example of study data (“study_data” or “ship”) or metadata (“meta_data” or “ship_meta”) shipped with the dataquieR package, you have to write the example file name without the path or extension.

sd1 <- prep_get_data_frame("study_data") # for the synthetic example data
sd2 <- prep_get_data_frame("ship") # for the SHIP-based example data

md1 <- prep_get_data_frame("meta_data") # for the synthetic example item level metadata
md2 <- prep_get_data_frame("ship_meta") # for the SHIP-based example item level metadata

Old option to import data shipped with the package

As an alternative, a user can first import the path of a dataquieR study data or metadata example and then use prep_get_data_frame.

file_name <- system.file("extdata", "study_data.xlsx", package = "dataquieR")
sd1 <- prep_get_data_frame(file_name)

4. Other options to import study data

Any functions in dataquieR can be also used with data already imported by other R packages, for example rio or readxl. Note: in case of .dta files we suggest to use the function import from the rio package.

library(rio)
sd1 <- import("~/Documents/data/study_data1.csv") 
md1 <- import("~/Documents/data/item_md1.xlsx") #There is only one spreadsheet in this Excel file


II. Import Metadata Excel Workbook

Multiple levels metadata are Excel workbook files containing multiple spreadsheets named following dataquieR conventions: “item_level”, “cross-item_level”, “segment_level”, “dataframe_level”. For more information see the metadata tutorial.

You can import this type of metadata using the dataquieR function prep_load_workbook_like_file in the following ways depending on their location:

1. Import metadata Excel workbook

The metadata file (metadata_type2.xlsx) is present in a local directory. For example metadata_type2.xlsx is in the “Documents” inside the folder “data”.

prep_load_workbook_like_file("C:/Users/Documents/data/metadata_type2.xlsx") 

After running this last function, no object will be visible in the Global Environment, as the data frames will be uploaded in dataquieR data frame storage area. See “Revise content in the dataquieR data frame storage area” at the end of this tutorial for more information about managing this storage area.

You can then get the data frames in the Global Environment using the function prep_get_data_frame.

# To get the item level metadata
md1 <- prep_get_data_frame("item_level")
# To get the cross-item level metadata
cil_md1 <- prep_get_data_frame("cross-item_level")
# To get the segment level metadata
sl_md1 <- prep_get_data_frame("segment_level")
# To get the data frame level metadata
df_md1<- prep_get_data_frame("dataframe_level")

2. Import dataquieR metadata Excel workbook examples

If you are importing the metadata shipped with the dataquieR package (meta_data_v2 or ship_meta_v2), the dataquieR function prep_load_workbook_like_file can be used with the name of the files without any path or extension.

prep_load_workbook_like_file("meta_data_v2") # for the synthetic example data
prep_load_workbook_like_file("ship_meta_v2") # for the SHIP-based example data

Then you can get the data frames in the Global Environment using the function prep_get_data_frame, as shown before.


III. Import all metadata and study data files from a folder

If you want you can load all the metadata files and the study data that are inside one folder with one function in dataquieR prep_load_folder_with_metadata. This function will load all the files in the dataquieR storage area.

For example to load all the metadata files in the folder “My_metadata_files” in D:/ you can use the following code:

# To load in the dataquieR storage area all the metadata files in the folder
prep_load_folder_with_metadata("D:/My_metadata_files/")

Then you can get the tables in the Global Environment using the function prep_get_data_frame, as shown before.



Revise the content of the dataquieR data frame storage area

The functions prep_load_workbook_like_file() and prep_load_folder_with_metadata add a set of data frames from a file or folder to the dataquieR data frame storage area. The following functions can be used to see or modify the content of this storage area:

  • prep_list_dataframes(), to see the content of this storage area, that is not visible in the Global Environment;

  • prep_get_data_frame(), to get a data frame from the dataquieR data frame storage area and have it available in the Global Environment;

  • prep_purge_data_frame_cache(), to delete everything from this storage area;

  • prep_remove_from_cache(), to delete only one data frame from this storage area;

  • prep_add_data_frames(), to add data frames to this storage area

Example:

# Import the dataquier metadata example for the synthetic data 
prep_load_workbook_like_file("meta_data_v2") 
# Look at the content of the dataquieR data frame storage area
prep_list_dataframes() 
# To add a data frame from the storage area to the Global Environment
df_md <- prep_get_data_frame("dataframe_level") 
# Remove all data frames from the dataquieR data frame storage area 
prep_purge_data_frame_cache() 

When a table is available in the dataquieR storage area, it can be directly used in a dataquieR function by its name.

For example, let’s import the synthetic example metadata and look at the content of the storage area

# Impot the dataquier metadata example for the synthetic data 
prep_load_workbook_like_file("meta_data_v2") 
# Look at the content of the dataquieR data frame storage area
prep_list_dataframes() 
##  [1] "cross-item_level"                   "dataframe_level"                   
##  [3] "expected_id"                        "item_level"                        
##  [5] "meta_data_v2.xlsx|cross-item_level" "meta_data_v2.xlsx|dataframe_level" 
##  [7] "meta_data_v2.xlsx|expected_id"      "meta_data_v2.xlsx|item_level"      
##  [9] "meta_data_v2.xlsx|missing_table"    "meta_data_v2.xlsx|segment_level"   
## [11] "meta_data_v2|cross-item_level"      "meta_data_v2|dataframe_level"      
## [13] "meta_data_v2|expected_id"           "meta_data_v2|item_level"           
## [15] "meta_data_v2|missing_table"         "meta_data_v2|segment_level"        
## [17] "missing_table"                      "segment_level"

Now you can use any of this tables in the functions just using their names as appear in the list. For example you can just use “item_level” in the meta_data argument to check for data type mismatches.

datatype_mismatch <- int_datatype_matrix(study_data = "study_data", 
                                         meta_data = "item_level")