Skip to content

Commit

Permalink
Add initial vignettes showing (1) coastwide index and (2) spatial ind…
Browse files Browse the repository at this point in the history
…ex using `split_` variable

Also add all_yellowtail_indices.png to inst for vignette 2
  • Loading branch information
ericward-noaa committed Jan 16, 2025
1 parent b1f1b89 commit 3537b8e
Show file tree
Hide file tree
Showing 3 changed files with 316 additions and 0 deletions.
Binary file added inst/all_yellowtail_indices.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
115 changes: 115 additions & 0 deletions vignettes/a1_single_index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
---
title: "Coastwide indices"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Coastwide indices}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
message = FALSE,
comment = "#>"
)
```

## Overview

This vignette highlights the ability to generate coastwide indices for stocks using the `indexwc` package.

## Yellowtail rockfish example

As a case study, we'll focus on yellowtail rockfish

```{r}
library(dplyr)
library(indexwc)
library(purrr)
library(sdmTMB)
library(tibble)
```

First we'll load in the configuration file. This is bringing in the data from Github -- but for users who download the source code, this is also available in the `data-raw` folder.
```{r}
url <- "https://raw.githubusercontent.com/pfmc-assessments/indexwc/refs/heads/main/data-raw/configuration.csv"
configuration <- tibble::as_tibble(read.csv(url))
configuration <- configuration |>
dplyr::filter(species == "yellowtail rockfish")
```

The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).


### Statistical model

For most indices developed on the west coast of the USA, the main formula is usually something similar to the following, where `fyear` represents a fixed year effect, and `pass_scaled` is factor describing whether the haul is part of pass 1 (early season) or pass 2 (late season).

```{r}
formula <- "catch_weight ~ 0 + fyear + pass_scaled"
```

### Preparing the data

This block pulls trawl survey from the West Coast Groundfish Bottom Trawl Survey and applies filters (based on latitude, depth, and year) to each observation.

```{r}
data <- configuration |>
dplyr::rowwise() |>
dplyr::mutate(
data_raw = list(format_data(eval(parse(text = fxn)))),
data_filtered = list(data_raw |>
dplyr::filter(
depth <= min_depth, depth >= max_depth,
latitude >= min_latitude, latitude <= max_latitude,
year >= min_year, year <= max_year
))
) |>
dplyr::ungroup()
```

### Fitting the model with indexwc and sdmTMB

Now we can use the `indexwc` package to fit the model. The `indexwc` package acts as a wrapper for `sdmTMB` here, combining the estimation process (`sdmTMB()`) with the index generation (`get_index()`).

This code is optimized to fit multiple models (across stocks / species, or model configurations). In our case, the configurations dataframe has 1 row -- but this could include multiple rows. A key point to highlight is that in this code, we use `sdmTMBcontrol()` to pass in a list that contains the variables above we use to map off and initialize the parameters. For datasets without missing values, you may not want to include this `sdmTMBcontrol()` line, but it may be useful to adjust the `newton_loops`, etc.

Note: this code will fit a single model for each row

```{r}
best <- data |>
dplyr::mutate(
# Evaluate the call in family
family = purrr::map(family, .f = ~ eval(parse(text = .x))),
# Run the model on each row in data
results = purrr::pmap(
.l = list(
data = data_filtered,
formula = formula,
family = family,
anisotropy = anisotropy,
n_knots = knots,
share_range = share_range,
spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list)
),
.f = indexwc::run_sdmtmb
)
)
```

### Examining the index and diagnostics

Output and diagnostics for all models and predictions are stored in a folder that `indexwc` creates; it's location is "yellowtail_rockfish/wcgbts/delta_gamma..". The "data" folder contains the raw data used for estimation (`data.rdata`) and a plot of the SPDE mesh is in `mesh.png`. The "index" folder contains diagnostics and predictions for 10 sets of indices. These 10 indices represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt Conception, a coastwide index, and state-specific indices. Plots include:
* density plots (`density01.png`, `density02.png`, ...) which represent predictions for pairs of years
* a coastwide index (`index_coastwide.png`)
* fixed effect parameters (`parameters_fixed_effects.png`)
* QQ plot (`qq.png`)
* residual plots (`residuals_1_01.png`, `residuals_1_02.png`, ..., `residuals_2_01.png`, `residuals_2_02.png`,...) where pairs of years are shown, and separate residual plots are made for the presence-absence and positive model components

Additional output includes
* the dataframe of `sdmTMB::sanity()` checks (in `sanity_data_frame.csv`)
* indices generated by area (`est_by_area.csv`)
201 changes: 201 additions & 0 deletions vignettes/a2_multiple_indices.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
---
title: "Index example with multiple areas"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Index example with multiple areas}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
message = FALSE,
comment = "#>"
)
```

## Overview

For some stocks, a goal might be to create multiple coastwide indices, where these indices are defined by latitudinal breaks. This vignette highlights the ability to fit these models using the `indexwc` package, and how to deal with sparse or missing observations

## Yellowtail rockfish example

As a multi-index example, we'll focus on yellowtail rockfish

```{r}
library(dplyr)
library(indexwc)
library(purrr)
library(sdmTMB)
library(tibble)
```

First we'll load in the configuration file. This is bringing in the data from Github -- but for users who download the source code, this is also available in the `data-raw` folder.
```{r}
url <- "https://raw.githubusercontent.com/pfmc-assessments/indexwc/refs/heads/main/data-raw/configuration.csv"
configuration <- tibble::as_tibble(read.csv(url))
configuration <- configuration |>
dplyr::filter(species == "yellowtail rockfish")
```

The configuration file contains settings you may wish to change, including attributes of filters (years, latitudes, depths), and model components (anisotropy, knots, spatiotemporal effects, etc).

### Statistical model

For most indices developed on the west coast of the USA, the main formula is usually something similar to the following, where `fyear` represents a fixed year effect, and `pass_scaled` is factor describing whether the haul is part of pass 1 (early season) or pass 2 (late season).

```{r}
formula <- "catch_weight ~ 0 + fyear + pass_scaled"
```

There are many ways to build on this general framework to allow for multi-area indices. For yellowtail rockfish, we're going to generate 2 indices, north and south of Cape Mendocino. Our main formula will be added to the configuration data frame as

```{r}
configuration$formula <- "catch_weight ~ 0 + fyear*split_mendocino + pass_scaled"
```

Including the interaction `fyear*split_mendocino` allows each region to have a separate trend, and because no interaction is included with `pass_scaled`, a global effect is assumed.

As a quick aside, the `indexwc` package allows for other spatial splits, and in implementing them, the choice of names must be consistent with those used in the package (and specifically in the prediction grid). Variables currently included are:
* `split_mendocino` representing a split N/S of 40.167 degrees
* `split_conception` representing a split N/S of 34.45 degrees
* `split_monterey` representing a split N/S of 36 degrees
* `split_state` representing three splits for California, Oregon, Washington

Generating new variables representing new splits will allow you to fit models with `sdmTMB`, but the predictions (index) will fail -- for these cases, a better solution is to merge them into the package.

### Preparing the data

This block applies filters (based on latitude, depth, and year) to each observation, and creates the new variable `split_mendocino` that already exists in the prediction dataframe.

```{r}
data <- configuration |>
# Row by row ... do stuff then ungroup
dplyr::rowwise() |>
# Pull the data based on the function found in fxn column
dplyr::mutate(
data_raw = list(format_data(eval(parse(text = fxn)))),
data_filtered = list(data_raw |>
dplyr::filter(
depth <= min_depth, depth >= max_depth,
latitude >= min_latitude, latitude <= max_latitude,
year >= min_year, year <= max_year
) |>
dplyr::mutate(split_mendocino = ifelse(latitude > 40.1666667, "N", "S")))
) |>
dplyr::ungroup()
```

Yellowtail rockfish occur more north of Cape Mendocino, and as a check we can see if there's any years with no positive catches. This filtering shows that no positive catches exist for 2007.

```{r}
dplyr::filter(data$data_filtered[[1]]) |>
dplyr::group_by(split_mendocino, year) |>
dplyr::summarise(sum_catch_weight = sum(catch_weight)) |>
dplyr::filter(sum_catch_weight == 0)
```

This is a problem because if we try to estimate the `fyear:split_mendocino` interaction, the coefficient will not be identifiable for 2007. This is not a unique problem for `sdmTMB` but a general issue for any regression / GLM / GAM / GLMM model. The workaround that we'll implment is to "map off" or not estimate the `fyear:split_mendocino` in 2007 -- but estimate it in all other years.

As a first step, we'll summarize the names of all main effect coefficients in our model. We use `lm()` to do this for simplicity, but other approaches could also be used.

```{r}
lm <- lm(
formula = as.formula(configuration$formula),
data = data$data_filtered[[1]]
)
coef_names <- names(coef(lm))
print(coef_names)
```

All of the coefficients from the presence - absence model look to be identifiable

```{r}
pres_not_identifiable <- names(which(is.na(coef(lm))))
print(pres_not_identifiable)
```

To identify the labels of the coefficients from the positive model that are not estimable, we can re-fit the regression, using only positive observations.

```{r}
lm_pos <- lm(
formula = as.formula(configuration$formula),
data = dplyr::filter(data$data_filtered[[1]], catch_weight > 0)
)
pos_not_identifiable <- names(which(is.na(coef(lm_pos))))
print(pos_not_identifiable)
```

Finally, we can create some mapping variables to be passed to `sdmTMB`. We'll do this separately for the presence-absence and positive models.

```{r}
map_pres <- coef_names
map_pres[coef_names %in% pres_not_identifiable] <- NA
map_pres <- factor(map_pres)
map_pos <- coef_names
map_pos[coef_names %in% pos_not_identifiable] <- NA
map_pos <- factor(map_pos)
```

Along with these mapping vectors, we'll create vectors of starting values (arbitrarily assigning -20 to values that are not estimable)

```{r}
start_pres <- rep(0, length(coef_names))
start_pres[coef_names %in% pres_not_identifiable] <- -20
start_pos <- rep(0, length(coef_names))
start_pos[coef_names %in% pos_not_identifiable] <- -20
```

### Fitting the model with indexwc and sdmTMB

Now we can use the `indexwc` package to fit the model. The `indexwc` package acts as a wrapper for `sdmTMB` here, combining the estimation process (`sdmTMB()`) with the index generation (`get_index()`).

This code is optimized to fit multiple models (across stocks / species, or model configurations). In our case, the configurations dataframe has 1 row -- but this could include multiple rows. A key point to highlight is that in this code, we use `sdmTMBcontrol()` to pass in a list that contains the variables above we use to map off and initialize the parameters. For datasets without missing values, you may not want to include this `sdmTMBcontrol()` line, but it may be useful to adjust the `newton_loops`, etc.

Note: this code will fit a single model for each row

```{r}
best <- data |>
dplyr::mutate(
# Evaluate the call in family
family = purrr::map(family, .f = ~ eval(parse(text = .x))),
# Run the model on each row in data
results = purrr::pmap(
.l = list(
data = data_filtered,
formula = formula,
family = family,
anisotropy = anisotropy,
n_knots = knots,
share_range = share_range,
spatiotemporal = purrr::map2(spatiotemporal1, spatiotemporal2, list),
sdmtmb_control = list(
sdmTMB::sdmTMBcontrol(
map = list(b_j = map_pos, b_j2 = map_pos),
start = list(b_j = start_pos, b_j2 = start_pos),
newton_loops = 1 # increase for real models
)
)
),
.f = indexwc::run_sdmtmb
)
)
```

### Examining the index and diagnostics

Output and diagnostics for all models and predictions are stored in a folder that `indexwc` creates; it's location is "yellowtail_rockfish/wcgbts/delta_gamma..". The "index" folder contains diagnostics and predictions for 10 sets of indices. These 10 indices represent (1) N/S of Cape Mendocino, (2) N/S of Monterrey, (3) N/S of Pt Conception, a coastwide index, and state-specific indices. For multi-region models, like the yellowtail rockfish used here, it wouldn't make sense to interpret indices other than the split used in the model (here, N/S of Cape Mendocino) and the coastwide index.

```{r echo = FALSE}
knitr::include_graphics(system.file("figures/all_yellowtail_indices.png", package = "indexwc"))
```

0 comments on commit 3537b8e

Please sign in to comment.