In this vignette, we will demonstrate how to work with shadow
matrices for NBDC studies. Shadow matrices are used to store information
about the reasons for missing values in the main data (see here
for more details). Currently, the HBCD Study releases shadow matrices,
while the ABCD Study does not (see below for how to use the R package
naniar
to create a basic shadow matrix for ABCD data).
Prepare data
NOTE: Before running the code in this vignette, we assume that you have already read the Get started page or the Join data vignette, and know how to load the data/shadow from a given NBDC study release.
For demonstration purposes, we load a simulated HBCD dataset and shadow matrix that are included with the package.
library(NBDCtools)
#> Welcome to the `NBDCtools` package! For more information, visit: https://software.nbdc-datahub.org/NBDCtools/
#> This package is developed by the ABCD Data Analysis, Informatics & Resource Center (DAIRC) at the J. Craig Venter Institute (JCVI)
data <- readRDS(
system.file("extdata", "simulated_data_hbcd.rds", package = "NBDCtools")
)
dplyr::glimpse(data)
#> Rows: 20
#> Columns: 10
#> $ participant_id <chr> "sub-0000000003…
#> $ session_id <chr> "ses-V01", "ses…
#> $ run_id <chr> NA, NA, NA, NA,…
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n <chr> "1", "1", "2", …
#> $ mh_cg_ibqr_surg_001 <chr> NA, NA, NA, NA,…
#> $ mh_cg_ibqr_date_taken <dttm> NA, NA, NA, NA…
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n` <dbl> NA, NA, NA, NA,…
#> $ bio_bm_biosample_nails_results_bio_nail_weight <int> 76, 100, 49, NA…
#> $ bio_bm_biosample_nails_results_gestational_age <dbl> 26, 22, 37, NA,…
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev` <int> NA, NA, NA, NA,…
shadow <- readRDS(
system.file("extdata", "simulated_data_hbcd_shadow.rds", package = "NBDCtools")
)
dplyr::glimpse(shadow)
#> Rows: 20
#> Columns: 10
#> $ participant_id <chr> "sub-0000000003…
#> $ session_id <chr> "ses-V01", "ses…
#> $ run_id <chr> NA, NA, NA, NA,…
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n <chr> NA, NA, NA, NA,…
#> $ mh_cg_ibqr_surg_001 <chr> "Reason 1", "Re…
#> $ mh_cg_ibqr_date_taken <chr> "Reason 2", "Re…
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n` <chr> "Reason 1", "Re…
#> $ bio_bm_biosample_nails_results_bio_nail_weight <chr> NA, NA, NA, "Re…
#> $ bio_bm_biosample_nails_results_gestational_age <chr> NA, NA, NA, "Re…
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev` <chr> "Reason 3", "Re…
IMPORTANT: Before running any shadow-related functions, it is very important to ensure that the dimensions of the main data and shadow matrix match, i.e., that they have the same number of rows and columns, all the columns in the main data are also present in the shadow matrix, and all the identifier pairs (
participant_id
,session_id
,run_id
) exist in both data frames.
The join_tabulated()
function has a parameter
remove_empty_rows
that removes rows that only have missing
values by default. This can lead to a situation where the main data and
shadow matrix have different dimensions, which will cause problems in
the subsequent shadow-related functions. If you know that downstream
processing involves shadow matrices, make sure to set this parameter to
remove_empty_rows = FALSE
when calling
join_tabulated()
.
This is not a problem if you are using create_dataset()
function, as it automatically sets
remove_empty_rows = FALSE
when creating a dataset with
shadow matrix.
Bind shadow matrix to main data
Using a provided shadow matrix
If you have a data frame with a shadow matrix, you can bind it to the
main data using the shadow_bind_data()
function. This
function will bind the shadow matrix to the data using the identifier
columns (participant_id
, session_id
, and
run_id
by default). The appended shadow columns will have
the same column names as the corresponding columns in the main data, but
with a suffix (_shadow
by default).
- If there are rows with identifier pairs in the main data that don’t
exist in the shadow matrix, the appended shadow matrix rows will be
filled with
NA
values. - If there are rows with identifier pairs in the shadow matrix that don’t exist in the main data, these rows will be dropped from the resulting data frame.
shadow_bind_data(
data = data,
shadow = shadow
) |>
dplyr::glimpse()
#> Rows: 20
#> Columns: 17
#> $ participant_id <chr> "sub-000…
#> $ session_id <chr> "ses-V01…
#> $ run_id <chr> NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n <chr> "1", "1"…
#> $ mh_cg_ibqr_surg_001 <chr> NA, NA, …
#> $ mh_cg_ibqr_date_taken <dttm> NA, NA,…
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n` <dbl> NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_nail_weight <int> 76, 100,…
#> $ bio_bm_biosample_nails_results_gestational_age <dbl> 26, 22, …
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev` <int> NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n_shadow <chr> NA, NA, …
#> $ mh_cg_ibqr_surg_001_shadow <chr> "Reason …
#> $ mh_cg_ibqr_date_taken_shadow <chr> "Reason …
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n_shadow` <chr> "Reason …
#> $ bio_bm_biosample_nails_results_bio_nail_weight_shadow <chr> NA, NA, …
#> $ bio_bm_biosample_nails_results_gestational_age_shadow <chr> NA, NA, …
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev_shadow` <chr> "Reason …
Using a naniar
shadow matrix
If there is no shadow matrix provided, like for ABCD data, or you do
not wish to use the provided shadow matrix, you can create a shadow
matrix using the naniar
R package.
To do so, set naniar_shadow = TRUE
.
IMPORTANT:
naniar
is not a dependency of theNBDCtools
package, so you need to install it separately. You can do so using the following command:
if (!requireNamespace("naniar", quietly = TRUE)) {
install.packages("naniar")
}
To create a shadow matrix using naniar
, you can use the
following command:
shadow_bind_data(
data = data,
shadow = NULL, # no shadow matrix provided
naniar_shadow = TRUE
) |>
dplyr::glimpse()
#> Rows: 20
#> Columns: 17
#> $ participant_id <chr> "sub-0000000…
#> $ session_id <chr> "ses-V01", "…
#> $ run_id <chr> NA, NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n <chr> "1", "1", "2…
#> $ mh_cg_ibqr_surg_001 <chr> NA, NA, NA, …
#> $ mh_cg_ibqr_date_taken <dttm> NA, NA, NA,…
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n` <dbl> NA, NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_nail_weight <int> 76, 100, 49,…
#> $ bio_bm_biosample_nails_results_gestational_age <dbl> 26, 22, 37, …
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev` <int> NA, NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_test_ordered_n_NA <fct> !NA, !NA, !N…
#> $ mh_cg_ibqr_surg_001_NA <fct> NA, NA, NA, …
#> $ mh_cg_ibqr_date_taken_NA <fct> NA, NA, NA, …
#> $ `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n_NA` <fct> NA, NA, NA, …
#> $ bio_bm_biosample_nails_results_bio_nail_weight_NA <fct> !NA, !NA, !N…
#> $ bio_bm_biosample_nails_results_gestational_age_NA <fct> !NA, !NA, !N…
#> $ `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev_NA` <fct> NA, NA, NA, …
The downside of this approach is that the shadow matrix will not
contain reasons for missing values, but will only indicate that the
value is missing. The benefit of using naniar_shadow = TRUE
is that you can use functionality from the naniar
R package
to explore and visualize missingness in the data (see here for
more details).
Fix missing values in the shadow matrix
Joining shadow matrices for different tables using
join_tabulated()
can result in missing values in the shadow
matrix for certain observations (e.g., because one instrument was
administered at more time points than another, or because a participant
did not complete a certain instrument). These missing values in the
shadow matrix could be mistakenly interpreted as indicating that the
main data has values for these cells. But indeed these cells are missing
in both the main data and the shadow matrix.
Here is a simplified example of this scenario:
my_table1 <- dplyr::tibble(
participant_id = c("sub-001", "sub-002"),
session_id = c("ses-001", "ses-002"),
run_id = c("run-001", "run-002"),
var1 = c("reason1", NA),
var2 = c(NA, "reason2")
)
my_table1
#> # A tibble: 2 × 5
#> participant_id session_id run_id var1 var2
#> <chr> <chr> <chr> <chr> <chr>
#> 1 sub-001 ses-001 run-001 reason1 NA
#> 2 sub-002 ses-002 run-002 NA reason2
my_table2 <- dplyr::tibble(
participant_id = c("sub-001", "sub-003"),
session_id = c("ses-001", "ses-003"),
run_id = c("run-001", "run-003"),
var3 = c(NA, "reason3"),
var4 = c("reason4", NA)
)
my_table2
#> # A tibble: 2 × 5
#> participant_id session_id run_id var3 var4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 sub-001 ses-001 run-001 NA reason4
#> 2 sub-003 ses-003 run-003 reason3 NA
When binding these two tables together, we will get the following result:
id_table <- dplyr::full_join(
dplyr::select(my_table1, participant_id, session_id, run_id),
dplyr::select(my_table2, participant_id, session_id, run_id)
)
#> Joining with `by = join_by(participant_id, session_id, run_id)`
dplyr::left_join(id_table, my_table1) |>
dplyr::left_join(my_table2)
#> Joining with `by = join_by(participant_id, session_id, run_id)`
#> Joining with `by = join_by(participant_id, session_id, run_id)`
#> # A tibble: 3 × 7
#> participant_id session_id run_id var1 var2 var3 var4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 sub-001 ses-001 run-001 reason1 NA NA reason4
#> 2 sub-002 ses-002 run-002 NA reason2 NA NA
#> 3 sub-003 ses-003 run-003 NA NA reason3 NA
We can see that there are missing values (NA
) in the
var3
and var4
columns for the
sub-002
/ses-002
/run-002
row and
in the var1
and var2
columns for the
sub-003
/ses-003
/run-003
row. If
this table was a shadow matrix, this would indicate that there should be
values for these cells in the main data, but in reality the shadow
matrix just has missing values due to joining the two tables.
To fix this issue, you can use the
shadow_replace_binding_missing()
function, which will
replace the NA
values in the shadow matrix with a custom
value (default "Missing due to joining"
).
The simulated shadow matrix does not have cases where the shadow
matrix has NA
values due to joining. For demonstration, we
will manually convert a few values to NA
:
shadow$mh_cg_ibqr_surg_001
#> [1] "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1"
#> [7] "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1"
#> [13] "Reason 1" NA "Reason 1" "Reason 1" "Reason 1" "Reason 1"
#> [19] "Reason 1" "Reason 1"
# set them to NA
shadow$mh_cg_ibqr_surg_001[1:5] <- NA_character_
shadow$mh_cg_ibqr_surg_001
#> [1] NA NA NA NA NA "Reason 1"
#> [7] "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1" "Reason 1"
#> [13] "Reason 1" NA "Reason 1" "Reason 1" "Reason 1" "Reason 1"
#> [19] "Reason 1" "Reason 1"
Now we can use the shadow_replace_binding_missing()
function to replace the NA
values in the shadow matrix with
an indicator:
shadow_replace_binding_missing(
data = data,
shadow = shadow
)
#> # A tibble: 20 × 10
#> participant_id session_id run_id bio_bm_biosample_nails…¹ mh_cg_ibqr_surg_001
#> <chr> <chr> <chr> <chr> <chr>
#> 1 sub-0000000003 ses-V01 NA NA Missing due to joi…
#> 2 sub-0000000004 ses-V01 NA NA Missing due to joi…
#> 3 sub-0000000003 ses-V02 NA NA Missing due to joi…
#> 4 sub-0000000005 ses-V01 NA NA Missing due to joi…
#> 5 sub-0000000002 ses-V02 1 NA Missing due to joi…
#> 6 sub-0000000009 ses-V01 NA Reason 3 Reason 1
#> 7 sub-0000000020 ses-V02 NA NA Reason 1
#> 8 sub-0000000013 ses-V01 NA NA Reason 1
#> 9 sub-0000000016 ses-V02 NA NA Reason 1
#> 10 sub-0000000014 ses-V01 NA NA Reason 1
#> 11 sub-0000000015 ses-V02 1 NA Reason 1
#> 12 sub-0000000011 ses-V02 NA NA Reason 1
#> 13 sub-0000000017 ses-V02 1 NA Reason 1
#> 14 sub-0000000007 ses-V01 NA NA NA
#> 15 sub-0000000006 ses-V01 NA NA Reason 1
#> 16 sub-0000000018 ses-V01 NA NA Reason 1
#> 17 sub-0000000019 ses-V02 2 NA Reason 1
#> 18 sub-0000000008 ses-V01 NA NA Reason 1
#> 19 sub-0000000010 ses-V01 NA NA Reason 1
#> 20 sub-0000000012 ses-V03 NA NA Reason 1
#> # ℹ abbreviated name: ¹bio_bm_biosample_nails_results_bio_test_ordered_n
#> # ℹ 5 more variables: mh_cg_ibqr_date_taken <chr>,
#> # `bio_bm_biosample_nails_results_bio_c_delta-9-thc_n` <chr>,
#> # bio_bm_biosample_nails_results_bio_nail_weight <chr>,
#> # bio_bm_biosample_nails_results_gestational_age <chr>,
#> # `img_brainswipes_xcpd-T2w_AnatOnAtlasBrainSwipes_nrev` <chr>
As you can see, the mh_cg_ibqr_surg_001
column now has
"Missing due to joining"
values.
IMPORTANT: Unlike the
shadow_bind_data()
where unmatched rows between the main data and the shadow matrix are allowed, this function requires the dimensions of the data and shadow matrix to match, i.e., to have the same number of rows and columns. If they do not match, the function will throw an error.