Automated data quality assessments using the R package dataquieR work by using the collected data (study data), and information and requirements about the study data (metadata). This tutorial informs about how to import study data and metadata to generate data quality reports with dataquieR.
The functions in dataquieR can use 2 types of metadata:
Item level metadata: a csv file or a single
spreadsheet in an Excel file containing information and expectations
about single variables. It is identified by the function argument
meta_data =
Multiple levels metadata: an Excel workbook with
multiple spreadsheets containing metadata organized in several tables
named following dataquieR conventions: “item_level”,
“cross-item_level”, “segment_level”,
“dataframe_level”. This type of metadata is identified by the
function argument meta_data_v2 =
See the metadata tutorial for more information.
You will need the dataquieR and rio packages installed for this tutorial. Here is the code to install both, if needed.
install.packages("dataquieR")
install.packages("rio")
The dq_report2()
report generation function does not
require the study data and metadata to be uploaded beforehand. Files are
directly imported by the function using a path, a URL, or just the file
name in case of the sample data shipped with the package. Hereafter
there is an example for each case.
Both study data (study_data1.csv
) and metadata
(metadata1.xlsx
) are locally available in the “Downloads”
folder.
dq_report2(study_data = "~/Downloads/study_data1.csv",
meta_data_v2 = "~/Downloads/metadata1.xlsx")
Data can be directly downloaded using the URL
dq_report2(study_data = "https://.../study_data1.csv",
meta_data_v2 = "https://.../metadata1.xlsx")
Only in the case of the two sample study data and metadata shipped with the dataquieR package, you can specify the name of the file without the path nor the extension.
dq_report2(study_data = "ship", meta_data_v2 = "ship_meta_v2") # for the synthetic example data
dq_report2(study_data = "study_data", meta_data_v2 = "meta_data_v2") # for the SHIP-based example data
In the case of data being already available in the R Global
Environment, you can specify the object names in the function. In the
following example, the study data is the object sd1
and the
item level metadata is the object md1
.
dq_report2(study_data = sd1,
meta_data = md1)
To apply single functions in dataquieR or to inspect the data is necessary to import the data. The dataquieR functions for importing data are:
prep_get_data_frame
prep_load_workbook_like_file
prep_load_folder_with_metadata
Study data and item level metadata can consist of single spreadsheet
in an Excel file or a csv file. You can import them using the dataquieR
function prep_get_data_frame
in the following ways
depending on their location.
Study data (study_data1.csv
) or item level metadata
(item_md1.xlsx
) are available in a local directory, for
example in “Documents” inside a folder named “data”.
library(dataquieR)
sd1 <- prep_get_data_frame("~/Documents/data/study_data1.csv")
md1 <- prep_get_data_frame("~/Documents/data/item_md1.xlsx") #There is only one spreadsheet in this Excel file
# This path is just an example, replace it with the correct path to the file you want to import
Study data (study_data1.csv
) or item level metadata
(item_md1.xlsx
) can be also imported using a URL with the
prep_get_data_frame
function.
sd1 <- prep_get_data_frame("https://.../study_data1.csv")
md1 <- prep_get_data_frame("https://.../item_md1.xlsx") #There is only one spreadsheet in this Excel file
To import an example of study data (“study_data” or “ship”) or metadata (“meta_data” or “ship_meta”) shipped with the dataquieR package, you have to write the example file name without the path or extension.
sd1 <- prep_get_data_frame("study_data") # for the synthetic example data
sd2 <- prep_get_data_frame("ship") # for the SHIP-based example data
md1 <- prep_get_data_frame("meta_data") # for the synthetic example item level metadata
md2 <- prep_get_data_frame("ship_meta") # for the SHIP-based example item level metadata
Old option to import data shipped with the package
As an alternative, a user can first import the path of a dataquieR
study data or metadata example and then use
prep_get_data_frame
.
file_name <- system.file("extdata", "study_data.xlsx", package = "dataquieR")
sd1 <- prep_get_data_frame(file_name)
Any functions in dataquieR can be also used with data already
imported by other R packages, for example rio or readxl. Note:
in case of .dta files we suggest to use the function import
from the rio package.
library(rio)
sd1 <- import("~/Documents/data/study_data1.csv")
md1 <- import("~/Documents/data/item_md1.xlsx") #There is only one spreadsheet in this Excel file
Multiple levels metadata are Excel workbook files containing multiple spreadsheets named following dataquieR conventions: “item_level”, “cross-item_level”, “segment_level”, “dataframe_level”. For more information see the metadata tutorial.
You can import this type of metadata using the dataquieR function
prep_load_workbook_like_file
in the following ways
depending on their location:
The metadata file (metadata_type2.xlsx
) is present in a
local directory. For example metadata_type2.xlsx
is in the
“Documents” inside the folder “data”.
prep_load_workbook_like_file("C:/Users/Documents/data/metadata_type2.xlsx")
After running this last function, no object will be visible in the Global Environment, as the data frames will be uploaded in dataquieR data frame storage area. See “Revise content in the dataquieR data frame storage area” at the end of this tutorial for more information about managing this storage area.
You can then get the data frames in the Global Environment using the
function prep_get_data_frame
.
# To get the item level metadata
md1 <- prep_get_data_frame("item_level")
# To get the cross-item level metadata
cil_md1 <- prep_get_data_frame("cross-item_level")
# To get the segment level metadata
sl_md1 <- prep_get_data_frame("segment_level")
# To get the data frame level metadata
df_md1<- prep_get_data_frame("dataframe_level")
If you are importing the metadata shipped with the dataquieR package
(meta_data_v2
or ship_meta_v2
), the dataquieR
function prep_load_workbook_like_file
can be used with the
name of the files without any path or extension.
prep_load_workbook_like_file("meta_data_v2") # for the synthetic example data
prep_load_workbook_like_file("ship_meta_v2") # for the SHIP-based example data
Then you can get the data frames in the Global Environment using the
function prep_get_data_frame
, as shown before.
If you want you can load all the metadata files and the study data
that are inside one folder with one function in dataquieR
prep_load_folder_with_metadata
. This function will load all
the files in the dataquieR storage area.
For example to load all the metadata files in the folder “My_metadata_files” in D:/ you can use the following code:
# To load in the dataquieR storage area all the metadata files in the folder
prep_load_folder_with_metadata("D:/My_metadata_files/")
Then you can get the tables in the Global Environment using the
function prep_get_data_frame
, as shown before.
The functions prep_load_workbook_like_file()
and
prep_load_folder_with_metadata
add a set of data frames
from a file or folder to the dataquieR data frame storage area. The
following functions can be used to see or modify the content of this
storage area:
prep_list_dataframes()
, to see the content of this
storage area, that is not visible in the Global Environment;
prep_get_data_frame()
, to get a data frame from the
dataquieR data frame storage area and have it available in the Global
Environment;
prep_purge_data_frame_cache()
, to delete everything
from this storage area;
prep_remove_from_cache()
, to delete only one data
frame from this storage area;
prep_add_data_frames()
, to add data frames to this
storage area
Example:
# Import the dataquier metadata example for the synthetic data
prep_load_workbook_like_file("meta_data_v2")
# Look at the content of the dataquieR data frame storage area
prep_list_dataframes()
# To add a data frame from the storage area to the Global Environment
df_md <- prep_get_data_frame("dataframe_level")
# Remove all data frames from the dataquieR data frame storage area
prep_purge_data_frame_cache()
When a table is available in the dataquieR storage area, it can be directly used in a dataquieR function by its name.
For example, let’s import the synthetic example metadata and look at the content of the storage area
# Impot the dataquier metadata example for the synthetic data
prep_load_workbook_like_file("meta_data_v2")
# Look at the content of the dataquieR data frame storage area
prep_list_dataframes()
## [1] "cross-item_level" "dataframe_level"
## [3] "expected_id" "item_level"
## [5] "meta_data_v2.xlsx|cross-item_level" "meta_data_v2.xlsx|dataframe_level"
## [7] "meta_data_v2.xlsx|expected_id" "meta_data_v2.xlsx|item_level"
## [9] "meta_data_v2.xlsx|missing_table" "meta_data_v2.xlsx|segment_level"
## [11] "meta_data_v2|cross-item_level" "meta_data_v2|dataframe_level"
## [13] "meta_data_v2|expected_id" "meta_data_v2|item_level"
## [15] "meta_data_v2|missing_table" "meta_data_v2|segment_level"
## [17] "missing_table" "segment_level"
Now you can use any of this tables in the functions just using their names as appear in the list. For example you can just use “item_level” in the meta_data argument to check for data type mismatches.
datatype_mismatch <- int_datatype_matrix(study_data = "study_data",
meta_data = "item_level")