This high-level function simplifies the process of creating a dataset from
the ABCD or HBCD Study data by allowing users to create an analysis-ready
dataset in a single step. It executes the lower-level functions provided in
the NBDCtools
package in sequence to load, join, and transform the data.
The function expects study data to be stored as one .parquet
or .tsv
file
per database table within a specified directory, provided as dir_data
.
Variables specified in vars
and tables
will be full-joined together,
while variables specified in vars_add
and tables_add
will be left-joined
to these variables. For more details, see join_tabulated()
.
In addition to the main create_dataset()
function, there are two
study-specific variations:
create_dataset_abcd()
: for the ABCD study.create_dataset_hbcd()
: for the HBCD study.
They have the same arguments as the create_dataset()
function, except
that the study
argument is set to the respective study by default, and
should not be set by the user.
Usage
create_dataset(
dir_data,
study,
vars = NULL,
tables = NULL,
vars_add = NULL,
tables_add = NULL,
release = "latest",
format = "parquet",
bypass_ram_check = FALSE,
categ_to_factor = TRUE,
add_labels = TRUE,
value_to_label = FALSE,
value_to_na = FALSE,
time_to_hms = FALSE,
bind_shadow = FALSE,
...
)
create_dataset_abcd(...)
create_dataset_hbcd(...)
Arguments
- dir_data
character. Path to the directory with the data files in
.parquet
or.tsv
format.- study
character. NBDC study (One of
"abcd"
or"hbcd"
).- vars
character (vector). Name(s) of variable(s) to be joined. (Default:
NULL
, i.e., no variables are selected; one oftables
orvars
has to be provided).- tables
character (vector). Name(s) of table(s) to be joined (Default:
NULL
, i.e., no tables are selected; one oftables
orvars
has to be provided).- vars_add
character (vector). Name(s) of additional variable(s) to be left-joined to the variables selected in
vars
andtables
(Default:NULL
, i.e., no additional variables are selected)- tables_add
character (vector). Name(s) of additional table(s) to be left-joined to the variables selected in
vars
andtables
(Default:NULL
, i.e., no additional tables are selected)- release
character. Release version (Default:
"latest"
)- format
character. Data format (One of
"parquet"
or"tsv"
; default:"parquet"
).- bypass_ram_check
logical. If
TRUE
, the function will not abort if the number of variables exceeds 10000 and current available RAM is less than 75% of the estimated RAM usage. This can prevent the long loading time of the data, but failing in the middle due to insufficient RAM. For large datasets, it is recommended to save 2 times or more of estimated RAM before running this function.This argument is only used for the ABCD study, as the HBCD data is small enough to be loaded without RAM issues with most personal computers. As HBCD data grows in the future, this may change.
- categ_to_factor
logical. Whether to convert categorical variables to factors class, see
transf_factor()
(Default:TRUE
).- add_labels
logical. Whether to adds variable and value labels to the variables, see
transf_label()
(Default:TRUE
).- value_to_label
logical. Whether to convert the categorical variables' numeric values to labels, see
transf_value_to_label()
(Default:FALSE
). To run this process,categ_to_factor
andadd_labels
must beTRUE
.- value_to_na
logical. Whether to convert categorical missingness/non-response codes to
NA
, seetransf_value_to_na()
(Default:FALSE
). To run this process,categ_to_factor
andadd_labels
must beTRUE
.- time_to_hms
logical. Whether to convert time variables to
hms
class, seetransf_time_to_hms()
(Default:FALSE
).- bind_shadow
logical. Whether to bind the shadow matrix to the dataset (Default:
FALSE
). See more in details.- ...
additional arguments passed to downstream functions after the
join_tabulated()
step. See examples for details.
Details
Order
This high-level function executes the different steps in the following order:
Read the data/shadow matrix using
join_tabulated()
.Convert categorical variables to factors using
transf_factor()
.Add labels to the variables and values using
transf_label()
.Convert categorical variables' numeric values to labels using
transf_value_to_label()
.Convert categorical missingness/non-response codes to
NA
usingtransf_value_to_na()
.Convert time variables to
hms
class usingtransf_time_to_hms()
.If
bind_shadow
and the study is"HBCD"
, replace the missing values in the shadow due to joining multiple datasets usingshadow_replace_binding_missing()
.Bind the shadow matrix to the data using
shadow_bind_data()
.
Not all steps are executed by default. The above order represents the maximal order of execution.
bind_shadow
If bind_shadow
is TRUE
, the shadow matrix will be added to the data using
shadow_bind_data()
.
HBCD study: For the
HBCD
study, this function uses the shadow matrix from thedir_data
directory by default (the HBCD Study releases a_shadow.parquet
/_shadow.tsv
file per table that accompanies the data). Alternatively, one can setnaniar_shadow = TRUE
as part of the...
arguments to usenaniar::as_shadow()
to create a shadow matrix from the data.ABCD study: The
ABCD
Study does not currently release shadow matrices. Ifbind_shadow
is set toTRUE
, the function will create the shadow matrix from the data usingnaniar::as_shadow()
; no extrananiar_shadow = TRUE
argument is needed.
Examples
if (FALSE) { # \dontrun{
# most common use case
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3")
)
# to handle with tagged missingness
create_dataset(
dir_data = "1_0/data",
study = "hbcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE
)
# to bind shadow matrices to the data
create_dataset(
dir_data = "1_0/data/",
study = "hbcd",
vars = c("var1", "var2", "var3"),
bind_shadow = TRUE
)
# to use the additional arguments
# for example in `value_to_na` option, the underlying function
# `transf_value_to_na()` has 2 more arguments,
# which can be passed to the `create_dataset()` function
create_dataset(
dir_data = "6_0/data",
study = "abcd",
vars = c("var1", "var2", "var3"),
value_to_na = TRUE,
missing_codes = c("999", "888", "777", "666", "555", "444", "333", "222"),
ignore_col_pattern = "__dk$|__dk__l$"
)
# use study specific functions
create_dataset_abcd(
dir_data = "6_0/data",
vars = c("var1", "var2", "var3")
)
} # }