Add initial vignettes showing (1) coastwide index and (2) spatial ind…

…ex using `split_` variable Also add all_yellowtail_indices.png to inst for vignette 2
pfmc-assessments · Jan 16, 2025 · 3537b8e · 3537b8e
1 parent b1f1b89
commit 3537b8e
Show file tree

Hide file tree

Showing 3 changed files with 316 additions and 0 deletions.
diff --git a/inst/all_yellowtail_indices.png b/inst/all_yellowtail_indices.png
diff --git a/vignettes/a1_single_index.Rmd b/vignettes/a1_single_index.Rmd
@@ -0,0 +1,115 @@
+---
+title: "Coastwide indices"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Coastwide indices}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  warning = FALSE,
+  message = FALSE,
+  comment = "#>"
+)
+```
+
+## Overview
+
+This vignette highlights the ability to generate coastwide indices for stocks using the `indexwc` package.
+
+## Yellowtail rockfish example
+
+As a case study, we'll focus on yellowtail rockfish
+
+```{r}
+library(dplyr)
+library(indexwc)
+library(purrr)
+library(sdmTMB)
+library(tibble)
+```
+
+First we'll load in the configuration file. This is bringing in the data from Github -- but for users who download the source code, this is also available in the `data-raw` folder. 
+```{r}
+url <- "https://raw.githubusercontent.com/pfmc-assessments/indexwc/refs/heads/main/data-raw/configuration.csv"
+configuration <- tibble::as_tibble(read.csv(url))
+
+configuration <- configuration |>
+  dplyr::filter(species == "yellowtail rockfish")
+```
+
+The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).
+
+
+### Statistical model
+
+For most indices developed on the west coast of the USA, the main formula is usually something similar to the following, where `fyear` represents a fixed year effect, and `pass_scaled` is factor describing whether the haul is part of pass 1 (early season) or pass 2 (late season).
+
+```{r}
+formula <- "catch_weight ~ 0 + fyear + pass_scaled"
+```
+
+### Preparing the data
+
+This block pulls trawl survey from the West Coast Groundfish Bottom Trawl Survey and applies filters (based on latitude, depth, and year) to each observation.
+
+```{r}
+data <- configuration |>
+  dplyr::rowwise() |>
+  dplyr::mutate(
+    data_raw = list(format_data(eval(parse(text = fxn)))),
+    data_filtered = list(data_raw |>
+      dplyr::filter(
+        depth <= min_depth, depth >= max_depth,
+        latitude >= min_latitude, latitude <= max_latitude,
+        year >= min_year, year <= max_year
+      ))
+  ) |>
+  dplyr::ungroup()
+```
+
+### Fitting the model with indexwc and sdmTMB
+
+Now we can use the `indexwc` package to fit the model. The `indexwc` package acts as a wrapper for `sdmTMB` here, combining the estimation process (`sdmTMB()`) with the index generation (`get_index()`).
+
+This code is optimized to fit multiple models (across stocks / species, or model configurations). In our case, the configurations dataframe has 1 row -- but this could include multiple rows. A key point to highlight is that in this code, we use `sdmTMBcontrol()` to pass in a list that contains the variables above we use to map off and initialize the parameters. For datasets without missing values, you may not want to include this `sdmTMBcontrol()` line, but it may be useful to adjust the `newton_loops`, etc.
+
+Note: this code will fit a single model for each row 
+
+```{r}
+best <- data |>
+  dplyr::mutate(
+    # Evaluate the call in family
+    family = purrr::map(family, .f = ~ eval(parse(text = .x))),
+    # Run the model on each row in data
+    results = purrr::pmap(
+      .l = list(
+        data = data_filtered,
+        formula = formula,
+        family = family,
+        anisotropy = anisotropy,
+        n_knots = knots,
+        share_range = share_range,
+        spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list)
+      ),
+      .f = indexwc::run_sdmtmb
+    )
+  )
+```
+
+### Examining the index and diagnostics
+
+Output and diagnostics for all models and predictions are stored in a folder that `indexwc` creates; it's location is "yellowtail_rockfish/wcgbts/delta_gamma..". The "data" folder contains the raw data used for estimation (`data.rdata`) and a plot of the SPDE mesh is in `mesh.png`. The "index" folder contains diagnostics and predictions for 10 sets of indices. These 10 indices represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt Conception, a coastwide index, and state-specific indices. Plots include:
+* density plots (`density01.png`, `density02.png`, ...) which represent predictions for pairs of years 
+* a coastwide index (`index_coastwide.png`)
+* fixed effect parameters (`parameters_fixed_effects.png`)
+* QQ plot (`qq.png`)
+* residual plots (`residuals_1_01.png`, `residuals_1_02.png`, ..., `residuals_2_01.png`, `residuals_2_02.png`,...) where pairs of years are shown, and separate residual plots are made for the presence-absence and positive model components
+
+Additional output includes
+* the dataframe of `sdmTMB::sanity()` checks (in `sanity_data_frame.csv`) 
+* indices generated by area (`est_by_area.csv`)
diff --git a/vignettes/a2_multiple_indices.Rmd b/vignettes/a2_multiple_indices.Rmd
@@ -0,0 +1,201 @@
+---
+title: "Index example with multiple areas"
+date: "`r Sys.Date()`"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Index example with multiple areas}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  warning = FALSE,
+  message = FALSE,
+  comment = "#>"
+)
+```
+
+## Overview
+
+For some stocks, a goal might be to create multiple coastwide indices, where these indices are defined by latitudinal breaks. This vignette highlights the ability to fit these models using the `indexwc` package, and how to deal with sparse or missing observations
+
+## Yellowtail rockfish example
+
+As a multi-index example, we'll focus on yellowtail rockfish
+
+```{r}
+library(dplyr)
+library(indexwc)
+library(purrr)
+library(sdmTMB)
+library(tibble)
+```
+
+First we'll load in the configuration file. This is bringing in the data from Github -- but for users who download the source code, this is also available in the `data-raw` folder. 
+```{r}
+url <- "https://raw.githubusercontent.com/pfmc-assessments/indexwc/refs/heads/main/data-raw/configuration.csv"
+configuration <- tibble::as_tibble(read.csv(url))
+
+configuration <- configuration |>
+  dplyr::filter(species == "yellowtail rockfish")
+```
+
+The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).
+
+### Statistical model
+
+For most indices developed on the west coast of the USA, the main formula is usually something similar to the following, where `fyear` represents a fixed year effect, and `pass_scaled` is factor describing whether the haul is part of pass 1 (early season) or pass 2 (late season).
+
+```{r}
+formula <- "catch_weight ~ 0 + fyear + pass_scaled"
+```
+
+There are many ways to build on this general framework to allow for multi-area indices. For yellowtail rockfish, we're going to generate 2 indices, north and south of Cape Mendocino. Our main formula will be added to the configuration data frame as 
+
+```{r}
+configuration$formula <- "catch_weight ~ 0 + fyear*split_mendocino + pass_scaled"
+```
+
+Including the interaction `fyear*split_mendocino` allows each region to have a separate trend, and because no interaction is included with `pass_scaled`, a global effect is assumed. 
+
+As a quick aside, the `indexwc` package allows for other spatial splits, and in implementing them, the choice of names must be consistent with those used in the package (and specifically in the prediction grid). Variables currently included are:
+* `split_mendocino` representing a split N/S of 40.167 degrees
+* `split_conception` representing a split N/S of 34.45 degrees
+* `split_monterey` representing a split N/S of 36 degrees
+* `split_state` representing three splits for California, Oregon, Washington
+
+Generating new variables representing new splits will allow you to fit models with `sdmTMB`, but the predictions (index) will fail -- for these cases, a better solution is to merge them into the package.
+
+### Preparing the data
+
+This block applies filters (based on latitude, depth, and year) to each observation, and creates the new variable `split_mendocino` that already exists in the prediction dataframe.
+
+```{r}
+data <- configuration |>
+  # Row by row ... do stuff then ungroup
+  dplyr::rowwise() |>
+  # Pull the data based on the function found in fxn column
+  dplyr::mutate(
+    data_raw = list(format_data(eval(parse(text = fxn)))),
+    data_filtered = list(data_raw |>
+      dplyr::filter(
+        depth <= min_depth, depth >= max_depth,
+        latitude >= min_latitude, latitude <= max_latitude,
+        year >= min_year, year <= max_year
+      ) |>
+      dplyr::mutate(split_mendocino = ifelse(latitude > 40.1666667, "N", "S")))
+  ) |>
+  dplyr::ungroup()
+```
+
+Yellowtail rockfish occur more north of Cape Mendocino, and as a check we can see if there's any years with no positive catches. This filtering shows that no positive catches exist for 2007.
+
+```{r}
+dplyr::filter(data$data_filtered[[1]]) |>
+  dplyr::group_by(split_mendocino, year) |>
+  dplyr::summarise(sum_catch_weight = sum(catch_weight)) |>
+  dplyr::filter(sum_catch_weight == 0)
+```
+
+This is a problem because if we try to estimate the `fyear:split_mendocino` interaction, the coefficient will not be identifiable for 2007. This is not a unique problem for `sdmTMB` but a general issue for any regression / GLM / GAM / GLMM model. The workaround that we'll implment is to "map off" or not estimate the `fyear:split_mendocino` in 2007 -- but estimate it in all other years.
+
+As a first step, we'll summarize the names of all main effect coefficients in our model. We use `lm()` to do this for simplicity, but other approaches could also be used.
+
+```{r}
+lm <- lm(
+  formula = as.formula(configuration$formula),
+  data = data$data_filtered[[1]]
+)
+
+coef_names <- names(coef(lm))
+
+print(coef_names)
+```
+
+All of the coefficients from the presence - absence model look to be identifiable
+
+```{r}
+pres_not_identifiable <- names(which(is.na(coef(lm))))
+print(pres_not_identifiable)
+```
+
+To identify the labels of the coefficients from the positive model that are not estimable, we can re-fit the regression, using only positive observations. 
+
+```{r}
+lm_pos <- lm(
+  formula = as.formula(configuration$formula),
+  data = dplyr::filter(data$data_filtered[[1]], catch_weight > 0)
+)
+
+pos_not_identifiable <- names(which(is.na(coef(lm_pos))))
+print(pos_not_identifiable)
+```
+
+Finally, we can create some mapping variables to be passed to `sdmTMB`. We'll do this separately for the presence-absence and positive models.
+
+```{r}
+map_pres <- coef_names
+map_pres[coef_names %in% pres_not_identifiable] <- NA
+map_pres <- factor(map_pres)
+
+map_pos <- coef_names
+map_pos[coef_names %in% pos_not_identifiable] <- NA
+map_pos <- factor(map_pos)
+```
+
+Along with these mapping vectors, we'll create vectors of starting values (arbitrarily assigning -20 to values that are not estimable)
+
+```{r}
+start_pres <- rep(0, length(coef_names))
+start_pres[coef_names %in% pres_not_identifiable] <- -20
+
+start_pos <- rep(0, length(coef_names))
+start_pos[coef_names %in% pos_not_identifiable] <- -20
+```
+
+### Fitting the model with indexwc and sdmTMB
+
+Now we can use the `indexwc` package to fit the model. The `indexwc` package acts as a wrapper for `sdmTMB` here, combining the estimation process (`sdmTMB()`) with the index generation (`get_index()`).
+
+This code is optimized to fit multiple models (across stocks / species, or model configurations). In our case, the configurations dataframe has 1 row -- but this could include multiple rows. A key point to highlight is that in this code, we use `sdmTMBcontrol()` to pass in a list that contains the variables above we use to map off and initialize the parameters. For datasets without missing values, you may not want to include this `sdmTMBcontrol()` line, but it may be useful to adjust the `newton_loops`, etc.
+
+Note: this code will fit a single model for each row 
+
+```{r}
+best <- data |>
+  dplyr::mutate(
+    # Evaluate the call in family
+    family = purrr::map(family, .f = ~ eval(parse(text = .x))),
+    # Run the model on each row in data
+    results = purrr::pmap(
+      .l = list(
+        data = data_filtered,
+        formula = formula,
+        family = family,
+        anisotropy = anisotropy,
+        n_knots = knots,
+        share_range = share_range,
+        spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list),
+        sdmtmb_control = list(
+          sdmTMB::sdmTMBcontrol(
+            map = list(b_j = map_pos, b_j2 = map_pos),
+            start = list(b_j = start_pos, b_j2 = start_pos),
+            newton_loops = 1 # increase for real models
+          )
+        )
+      ),
+      .f = indexwc::run_sdmtmb
+    )
+  )
+```
+
+### Examining the index and diagnostics
+
+Output and diagnostics for all models and predictions are stored in a folder that `indexwc` creates; it's location is "yellowtail_rockfish/wcgbts/delta_gamma..". The "index" folder contains  diagnostics and predictions for 10 sets of indices. These 10 indices represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt Conception, a coastwide index, and state-specific indices. For multi-region models, like the yellowtail rockfish used here, it wouldn't make sense to interpret indices other than the split used in the model (here, N/S of Cape Mendocino) and the coastwide index. 
+
+```{r echo = FALSE}
+knitr::include_graphics(system.file("figures/all_yellowtail_indices.png", package = "indexwc"))
+```
+