This high-level function simplifies the process of creating a dataset from
the ABCD or HBCD Study data by allowing users to create an analysis-ready
dataset in a single step. It executes the lower-level functions provided in
the NBDCtools package in sequence to load, join, and transform the data.
The function expects study data to be stored as one .parquet or .tsv file
per database table within a specified directory, provided as dir_data.
Variables specified in vars and tables will be full-joined together,
while variables specified in vars_add and tables_add will be left-joined
to these variables. For more details, see join_tabulated().
In addition to the main create_dataset() function, there are two
study-specific variations:
create_dataset_abcd(): for the ABCD study.create_dataset_hbcd(): for the HBCD study.
They have the same arguments as the create_dataset() function, except
that the study argument is set to the respective study by default, and
should not be set by the user.
Usage
create_dataset(
dir_data,
study,
vars = NULL,
tables = NULL,
vars_add = NULL,
tables_add = NULL,
release = "latest",
format = "parquet",
bypass_ram_check = FALSE,
categ_to_factor = TRUE,
add_labels = TRUE,
value_to_label = FALSE,
value_to_na = FALSE,
time_to_hms = FALSE,
bind_shadow = FALSE,
...
)
create_dataset_abcd(...)
create_dataset_hbcd(...)Arguments
- dir_data
character. Path to the directory with the data files in
.parquetor.tsvformat.- study
character. NBDC study (One of
"abcd"or"hbcd").- vars
character (vector). Name(s) of variable(s) to be joined. (Default:
NULL, i.e., no variables are selected; one oftablesorvarshas to be provided).- tables
character (vector). Name(s) of table(s) to be joined (Default:
NULL, i.e., no tables are selected; one oftablesorvarshas to be provided).- vars_add
character (vector). Name(s) of additional variable(s) to be left-joined to the variables selected in
varsandtables(Default:NULL, i.e., no additional variables are selected)- tables_add
character (vector). Name(s) of additional table(s) to be left-joined to the variables selected in
varsandtables(Default:NULL, i.e., no additional tables are selected)- release
character. Release version (Default:
"latest")- format
character. Data format (One of
"parquet"or"tsv"; default:"parquet").- bypass_ram_check
logical. If
TRUE, the function will not abort if the number of variables exceeds 10000 and current available RAM is less than 75% of the estimated RAM usage. This can prevent the long loading time of the data, but failing in the middle due to insufficient RAM. For large datasets, it is recommended to save 2 times or more of estimated RAM before running this function.This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change.
- categ_to_factor
logical. Whether to convert categorical variables to factors class, see
transf_factor()(Default:TRUE).- add_labels
logical. Whether to adds variable and value labels to the variables, see
transf_label()(Default:TRUE).- value_to_label
logical. Whether to convert the categorical variables' numeric values to labels, see
transf_value_to_label()(Default:FALSE). To run this process,categ_to_factorandadd_labelsmust beTRUE.- value_to_na
logical. Whether to convert categorical missingness/non-response codes to
NA, seetransf_value_to_na()(Default:FALSE). To run this process,categ_to_factorandadd_labelsmust beTRUE.- time_to_hms
logical. Whether to convert time variables to
hmsclass, seetransf_time_to_hms()(Default:FALSE).- bind_shadow
logical. Whether to bind the shadow matrix to the dataset (Default:
FALSE). See more in details.- ...
additional arguments passed to downstream functions after the
join_tabulated()step. See examples for details.
Details
Order
This high-level function executes the different steps in the following order:
Read the data/shadow matrix using
join_tabulated().Convert categorical variables to factors using
transf_factor().Add labels to the variables and values using
transf_label().Convert categorical variables' numeric values to labels using
transf_value_to_label().Convert categorical missingness/non-response codes to
NAusingtransf_value_to_na().Convert time variables to
hmsclass usingtransf_time_to_hms().If
bind_shadowand the study is"HBCD", replace the missing values in the shadow due to joining multiple datasets usingshadow_replace_binding_missing().Bind the shadow matrix to the data using
shadow_bind_data().
Not all steps are executed by default. The above order represents the maximal order of execution.
bind_shadow
If bind_shadow is TRUE, the shadow matrix will be added to the data using
shadow_bind_data().
HBCD study: For the
HBCDstudy, this function uses the shadow matrix from thedir_datadirectory by default (the HBCD Study releases a_shadow.parquet/_shadow.tsvfile per table that accompanies the data). Alternatively, one can setnaniar_shadow = TRUEas part of the...arguments to usenaniar::as_shadow()to create a shadow matrix from the data.ABCD study: The
ABCDStudy does not currently release shadow matrices. Ifbind_shadowis set toTRUE, the function will create the shadow matrix from the data usingnaniar::as_shadow(); no extrananiar_shadow = TRUEargument is needed.
Examples
if (FALSE) { # \dontrun{
# most common use case
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3")
)
# to handle with tagged missingness
create_dataset(
dir_data = "1_0/data",
study = "hbcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE
)
# to bind shadow matrices to the data
create_dataset(
dir_data = "1_0/data/",
study = "hbcd",
vars = c("var1", "var2", "var3"),
bind_shadow = TRUE
)
# to use the additional arguments
# for example in `value_to_na` option, the underlying function
# `transf_value_to_na()` has 2 more arguments,
# which can be passed to the `create_dataset()` function
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE,
missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"),
ignore_col_pattern = "__dk$|__dk__l$"
)
# use study specific functions
create_dataset_abcd(
dir_data = "6_0/data",
vars = c("var1", "var2", "var3")
)
} # }