In this vignette, we will demonstrate how to apply different transformations to a joined dataset such as converting categorical columns to factors, adding variable and value labels, converting categorical values to labels, and more.
Prepare data
NOTE: Before running the code in this vignette, we assume that you have already read the Get started vignette or the Join data vignette, and know how to load the data for a given NBDC study release.
To demonstrate the transformation functions, we load a simulated ABCD dataset that is included with the package.
library(NBDCtools)
#> Welcome to the `NBDCtools` package! For more information, visit: https://software.nbdc-datahub.org/NBDCtools/
#> This package is developed by the ABCD Data Analysis, Informatics & Resource Center (DAIRC) at the J. Craig Venter Institute (JCVI)
data <- readRDS(
system.file("extdata", "simulated_data_abcd.rds", package = "NBDCtools")
)
dplyr::glimpse(data)
#> Rows: 10
#> Columns: 10
#> $ participant_id <chr> "sub-0000000006", "sub-0000000007", …
#> $ session_id <chr> "ses-04A", "ses-05A", "ses-02A", "se…
#> $ ab_g_dyn__visit_type <chr> "3", "1", "1", "1", "3", "3", "2", "…
#> $ ab_g_dyn__cohort_grade <chr> "9", "6", NA, "7", "8", NA, "8", "8"…
#> $ ab_g_dyn__visit__day1_dt <date> 2020-10-19, 2022-05-19, 2022-09-12,…
#> $ ab_g_stc__gen_pc__01 <dbl> -0.022995395, 0.006506271, 0.0039703…
#> $ ab_g_dyn__visit_age <dbl> 11.63836, 13.18630, 10.11475, 10.114…
#> $ ab_g_dyn__visit_days <int> 1, 1, 2, 1, 1, 1, 2, 2, 1, 2
#> $ mr_y_qc__raw__dmri__r01__series_t <chr> NA, NA, NA, "11:05:35", "15:47:35", …
#> $ ab_g_dyn__visit_dtt <dttm> 2022-05-19 13:01:00, 2019-02-17 09:2…
Convert categorical columns to factors
From looking at the simulated data, we can see that categorical
columns like ab_g_dyn__visit_type
are of type “character”.
We can use transf_factor()
function to convert these
columns to type “factor” which is the correct type for categorical
variables in R.
data_transf <- data |>
transf_factor(study = "abcd")
dplyr::glimpse(data_transf)
#> Rows: 10
#> Columns: 10
#> $ participant_id <chr> "sub-0000000006", "sub-0000000007", …
#> $ session_id <fct> ses-04A, ses-05A, ses-02A, ses-01A, …
#> $ ab_g_dyn__visit_type <fct> 3, 1, 1, 1, 3, 3, 2, 3, 2, 1
#> $ ab_g_dyn__cohort_grade <ord> 9, 6, NA, 7, 8, NA, 8, 8, 8, NA
#> $ ab_g_dyn__visit__day1_dt <date> 2020-10-19, 2022-05-19, 2022-09-12, …
#> $ ab_g_stc__gen_pc__01 <dbl> -0.022995395, 0.006506271, 0.0039703…
#> $ ab_g_dyn__visit_age <dbl> 11.63836, 13.18630, 10.11475, 10.11…
#> $ ab_g_dyn__visit_days <int> 1, 1, 2, 1, 1, 1, 2, 2, 1, 2
#> $ mr_y_qc__raw__dmri__r01__series_t <chr> NA, NA, NA, "11:05:35", "15:47:35", …
#> $ ab_g_dyn__visit_dtt <dttm> 2022-05-19 13:01:00, 2019-02-17 09:2…
The transf_factor()
function automatically detects
categorical columns and converts them to ordered or unordered factors
based on the specification in the data dictionary and levels table for a
given study.
Apply variable and value labels
Next, we can add variable and value labels to the dataset.
data_transf <- data_transf |>
transf_label(study = "abcd")
To inspect the variable labels, we can use
sjlabelled::get_label()
function:
sjlabelled::get_label(data_transf)
#> participant_id
#> "Participant identifier"
#> session_id
#> "Event identifier"
#> ab_g_dyn__visit_type
#> "Visit information: Type of visit (In-person, remote, hybrid)"
#> ab_g_dyn__cohort_grade
#> "Cohort description: Current school grade [Cross-listed: ab_p_demo__ed__yth_001]"
#> ab_g_dyn__visit__day1_dt
#> "Visit information (day 1): Visit date"
#> ab_g_stc__gen_pc__01
#> "Genetics: First principal component of genetic ancestry"
#> ab_g_dyn__visit_age
#> "Visit information: Youth's age at the start of the event"
#> ab_g_dyn__visit_days
#> "Visit information: Number of visit days"
#> mr_y_qc__raw__dmri__r01__series_t
#> "Diffusion MRI (run 1): Series time"
#> ab_g_dyn__visit_dtt
#> "Visit information: Date and time at the start of the visit"
To inspect the value labels, we can use
sjlabelled::get_labels()
function:
sjlabelled::get_labels(data_transf, attr.only = TRUE, values = "n")
#> $participant_id
#> NULL
#>
#> $session_id
#> ses-00S ses-01A ses-02A ses-03A ses-04A ses-05A
#> "Screener" "1 Year" "2 Year" "3 Year" "4 Year" "5 Year"
#>
#> $ab_g_dyn__visit_type
#> 1 2 3
#> "On-site" "Remote" "Hybrid"
#>
#> $ab_g_dyn__cohort_grade
#> 0 1 10
#> "Kindergarten" "1st grade" "10th grade"
#> 11 12 13
#> "11th grade" "12th grade" "College"
#> 14 2 3
#> "Not enrolled in school" "2nd grade" "3rd grade"
#> 4 5 6
#> "4th grade" "5th grade" "6th grade"
#> 7 8 9
#> "7th grade" "8th grade" "9th grade"
#>
#> $ab_g_dyn__visit__day1_dt
#> NULL
#>
#> $ab_g_stc__gen_pc__01
#> NULL
#>
#> $ab_g_dyn__visit_age
#> NULL
#>
#> $ab_g_dyn__visit_days
#> NULL
#>
#> $mr_y_qc__raw__dmri__r01__series_t
#> NULL
#>
#> $ab_g_dyn__visit_dtt
#> NULL
If the labeling is done incorrectly, we can simply rerun the
transf_label()
function to fix it. If we want to remove all
labels, we can use
data_labels_removed <- sjlabelled::remove_all_labels(data_transf)
sjlabelled::get_label(data_labels_removed)
#> participant_id session_id
#> "" ""
#> ab_g_dyn__visit_type ab_g_dyn__cohort_grade
#> "" ""
#> ab_g_dyn__visit__day1_dt ab_g_stc__gen_pc__01
#> "" ""
#> ab_g_dyn__visit_age ab_g_dyn__visit_days
#> "" ""
#> mr_y_qc__raw__dmri__r01__series_t ab_g_dyn__visit_dtt
#> "" ""
sjlabelled::get_labels(data_labels_removed, attr.only = TRUE, values = "n")
#> $participant_id
#> NULL
#>
#> $session_id
#> NULL
#>
#> $ab_g_dyn__visit_type
#> NULL
#>
#> $ab_g_dyn__cohort_grade
#> NULL
#>
#> $ab_g_dyn__visit__day1_dt
#> NULL
#>
#> $ab_g_stc__gen_pc__01
#> NULL
#>
#> $ab_g_dyn__visit_age
#> NULL
#>
#> $ab_g_dyn__visit_days
#> NULL
#>
#> $mr_y_qc__raw__dmri__r01__series_t
#> NULL
#>
#> $ab_g_dyn__visit_dtt
#> NULL
Convert time columns to hms format
Time columns in the dataset
(e.g. mr_y_qc__raw__dmri__r01__series_t
) are formatted as
character strings "HH:MM:SS"
by default. If we want to
convert these columns into hms
format, we can use the transf_time_to_hms()
function:
data_transf <- data_transf |>
transf_time_to_hms(study = "abcd")
dplyr::glimpse(data_transf)
#> Rows: 10
#> Columns: 10
#> $ participant_id <chr> "sub-0000000006", "sub-0000000007", …
#> $ session_id <fct> ses-04A, ses-05A, ses-02A, ses-01A, …
#> $ ab_g_dyn__visit_type <fct> 3, 1, 1, 1, 3, 3, 2, 3, 2, 1
#> $ ab_g_dyn__cohort_grade <ord> 9, 6, NA, 7, 8, NA, 8, 8, 8, NA
#> $ ab_g_dyn__visit__day1_dt <date> 2020-10-19, 2022-05-19, 2022-09-12, …
#> $ ab_g_stc__gen_pc__01 <dbl> -0.022995395, 0.006506271, 0.0039703…
#> $ ab_g_dyn__visit_age <dbl> 11.63836, 13.18630, 10.11475, 10.11…
#> $ ab_g_dyn__visit_days <int> 1, 1, 2, 1, 1, 1, 2, 2, 1, 2
#> $ mr_y_qc__raw__dmri__r01__series_t <time> NA, NA, NA, 11:05:…
#> $ ab_g_dyn__visit_dtt <dttm> 2022-05-19 13:01:00, 2019-02-17 09:2…
As we can see, the column type is converted from
character
to time
(hms
)
class.
Convert categorical column levels to labels
In some cases, such as for creating plots, it is useful to convert
categorical values to labels. We can use the
transf_value_to_label()
function to do so:
data_transf |>
transf_value_to_label()
#> # A tibble: 10 × 10
#> participant_id session_id ab_g_dyn__visit_type ab_g_dyn__cohort_grade
#> <chr> <fct> <fct> <ord>
#> 1 sub-0000000006 ses-04A Hybrid 9th grade
#> 2 sub-0000000007 ses-05A On-site 6th grade
#> 3 sub-0000000001 ses-02A On-site NA
#> 4 sub-0000000003 ses-01A On-site 7th grade
#> 5 sub-0000000010 ses-04A Hybrid 8th grade
#> 6 sub-0000000008 ses-03A Hybrid NA
#> 7 sub-0000000004 ses-04A Remote 8th grade
#> 8 sub-0000000009 ses-00S Hybrid 8th grade
#> 9 sub-0000000005 ses-01A Remote 8th grade
#> 10 sub-0000000002 ses-03A On-site NA
#> # ℹ 6 more variables: ab_g_dyn__visit__day1_dt <date>,
#> # ab_g_stc__gen_pc__01 <dbl>, ab_g_dyn__visit_age <dbl>,
#> # ab_g_dyn__visit_days <int>, mr_y_qc__raw__dmri__r01__series_t <time>,
#> # ab_g_dyn__visit_dtt <dttm>
NOTE: Before running this function, make sure that the data has been transformed with the
transf_factor()
andtransf_label()
functions, so that the variable and value labels are available.
Convert missing codes to NA
In ABCD and HBCD Study datasets, some of the categorical columns use
specific codes to denote missingness/non-responses (e.g.,
"999"
for “Don’t know” or "777"
for “Decline
to answer”). If we want to remove these values before analysis, we can
use the transf_value_to_na()
function to convert these
codes to NA
.
NOTE: By default, this function converts all standard categorical missingness codes (
"222"
through"999"
) toNA
. In the ABCD Study, these codes are consistenly used throughout the whole dataset; in the HBCD Study, however, columns may use different codes for non-responses or missing values. Please refer to the data dictionary and levels table for the specific study to see which codes to convert.
data_transf |>
transf_value_to_na()
#> # A tibble: 10 × 10
#> participant_id session_id ab_g_dyn__visit_type ab_g_dyn__cohort_grade
#> <chr> <fct> <fct> <ord>
#> 1 sub-0000000006 ses-04A 3 9
#> 2 sub-0000000007 ses-05A 1 6
#> 3 sub-0000000001 ses-02A 1 NA
#> 4 sub-0000000003 ses-01A 1 7
#> 5 sub-0000000010 ses-04A 3 8
#> 6 sub-0000000008 ses-03A 3 NA
#> 7 sub-0000000004 ses-04A 2 8
#> 8 sub-0000000009 ses-00S 3 8
#> 9 sub-0000000005 ses-01A 2 8
#> 10 sub-0000000002 ses-03A 1 NA
#> # ℹ 6 more variables: ab_g_dyn__visit__day1_dt <date>,
#> # ab_g_stc__gen_pc__01 <dbl>, ab_g_dyn__visit_age <dbl>,
#> # ab_g_dyn__visit_days <int>, mr_y_qc__raw__dmri__r01__series_t <time>,
#> # ab_g_dyn__visit_dtt <dttm>
The simulated dataset does not contain categorical missingness codes.
However, this function provides a custom parameter
missing_codes
to specify the codes that should be converted
to NA
. For example, if we want to convert the categorical
values "1"
and "2"
to NA
, we can
use:
data_transf |>
transf_value_to_na(missing_codes = c("1", "2"))
#> # A tibble: 10 × 10
#> participant_id session_id ab_g_dyn__visit_type ab_g_dyn__cohort_grade
#> <chr> <fct> <fct> <ord>
#> 1 sub-0000000006 ses-04A 3 9
#> 2 sub-0000000007 ses-05A NA 6
#> 3 sub-0000000001 ses-02A NA NA
#> 4 sub-0000000003 ses-01A NA 7
#> 5 sub-0000000010 ses-04A 3 8
#> 6 sub-0000000008 ses-03A 3 NA
#> 7 sub-0000000004 ses-04A NA 8
#> 8 sub-0000000009 ses-00S 3 8
#> 9 sub-0000000005 ses-01A NA 8
#> 10 sub-0000000002 ses-03A NA NA
#> # ℹ 6 more variables: ab_g_dyn__visit__day1_dt <date>,
#> # ab_g_stc__gen_pc__01 <dbl>, ab_g_dyn__visit_age <dbl>,
#> # ab_g_dyn__visit_days <int>, mr_y_qc__raw__dmri__r01__series_t <time>,
#> # ab_g_dyn__visit_dtt <dttm>
In the ab_g_dyn__visit_type
column, we can see that the
values "1"
and "2"
have been converted to
NA
.
The function has another parameter, ignore_col_pattern
,
that can be used to ignore specific columns, so that they are exempt
from the conversion. This parameter accepts a regular expression
pattern, meaning all columns that match the pattern will be ignored. For
example, we can ignore all columns that start with
ab_g_dyn__visit
by using:
data_transf |>
transf_value_to_na(ignore_col_pattern = "^ab_g_dyn__visit")
#> # A tibble: 10 × 10
#> participant_id session_id ab_g_dyn__visit_type ab_g_dyn__cohort_grade
#> <chr> <fct> <fct> <ord>
#> 1 sub-0000000006 ses-04A 3 9
#> 2 sub-0000000007 ses-05A 1 6
#> 3 sub-0000000001 ses-02A 1 NA
#> 4 sub-0000000003 ses-01A 1 7
#> 5 sub-0000000010 ses-04A 3 8
#> 6 sub-0000000008 ses-03A 3 NA
#> 7 sub-0000000004 ses-04A 2 8
#> 8 sub-0000000009 ses-00S 3 8
#> 9 sub-0000000005 ses-01A 2 8
#> 10 sub-0000000002 ses-03A 1 NA
#> # ℹ 6 more variables: ab_g_dyn__visit__day1_dt <date>,
#> # ab_g_stc__gen_pc__01 <dbl>, ab_g_dyn__visit_age <dbl>,
#> # ab_g_dyn__visit_days <int>, mr_y_qc__raw__dmri__r01__series_t <time>,
#> # ab_g_dyn__visit_dtt <dttm>