03_habitat-covariates.Rmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---

# Habitat Covariates {#habitat}

## Introduction {#habitat-intro}

Species distribution models work by finding associations between species occurrence or abundance and environmental variables. Using these relationships, it's possible to predict the distribution in areas that aren't sampled, provided we know the value of the environmental variables in these areas. Therefore, to proceed with the modeling in the next several chapters, we'll need to prepare a suite of environmental variables to be used as covariates in our models. The particular set of covariates that's most suitable for a given study will depend on the focal species, region, and time period, as well as the availability of data. Fortunately, there is an abundance of freely available, satellite-based landcover products derived from satellites such as [Landsat](https://en.wikipedia.org/wiki/Landsat_program), [SPOT](https://en.wikipedia.org/wiki/SPOT_(satellite)), and [MODIS](https://en.wikipedia.org/wiki/Moderate_Resolution_Imaging_Spectroradiometer) that are suitable for distribution modeling.

For the examples in this book, we'll use habitat covariates derived from the [MODIS MCD12Q1 v006](https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mcd12q1_v006) landcover product [@friedlMCD12Q1MODISTerra2015]. This product has global coverage at 500 m spatial resolution and annual temporal resolution from 2001-2017. These data are available for several different classification schemes. We'll use the University of Maryland (UMD) landcover classification, which provides a globally accurate classification of landcover in our experience. This system classifies pixels into one of 16 different landcover classes:

```{r habitat, echo = FALSE}
lc_classes <- readr::read_csv("data/mcd12q1_classes.csv")
names(lc_classes) <- stringr::str_to_title(names(lc_classes))
knitr::kable(lc_classes)
```

For a wide range of studies, this MODIS landcover dataset will be suitable for generating habitat covariates; however, there may be particular cases where the study species, habitat, or ecological question requires different, or more specialized, data. For example, shorebird distribution modeling would benefit from data on the [extent of tidal flats](https://www.intertidal.app), seabirds distributions are often influenced by [ocean depth](https://eatlas.org.au/data/uuid/80301676-97fb-4bdf-b06c-e961e5c0cb0b), and in many regions [elevation](https://github.com/jhollist/elevatr) plays a critical role in shaping species distributions. Regardless of which habitat data you decide to use for your project, this chapter should provide a template for how to prepare these data as covariates for modeling species distributions.

The next section will cover how to access and download MODIS landcover data. Next, we'll demonstrate how to summarize these data within a neighborhood around each checklist location. Finally, we'll calculate a set of covariates over a regular grid, which we'll use to make predictions of species distributions throughout our study area. If you want to skip this section and jump straight to the modeling, you can download the data package, which includes all the prepared MODIS data that we'll use in the remainder of this book.

## Downloading MODIS data {#habitat-dl}

As with most satellite data, MODIS data are provided as [1200 km by 1200 km tiles](https://modis-land.gsfc.nasa.gov/MODLAND_grid.html) for ease of download. Each tile is a [raster GIS dataset](http://desktop.arcgis.com/en/arcmap/10.3/manage-data/raster-and-images/what-is-raster-data.htm) consisting of a regular grid of 500 m resolution cells. The surface of the Earth is divided up into a grid of these tiles, each given an ID, for example, h10v12 is the tile from the 10th column and 12th row of the grid. Compiling MODIS data for a given region requires figuring out which set of tiles covers the region, downloading those tiles, combining the tiles together into a single raster dataset, and converting from the native MODIS HDF format, which R can't read, to a standard GeoTIFF format. This needs to be done for each year for which we want habitat data, and can be a time consuming and error prone process. Fortunately, the [R package `MODIS`](https://github.com/MatMatt/MODIS) automates most of these steps. Unfortunately, this package can be challenging and confusing to get working. With this in mind, this section will provide detailed instruction for setting up and using the `MODIS` package.

Let's start by figuring out the tile IDs for the tiles that BCR 27 spans. Recall that we prepared a BCR boundary in Section \@ref(intro-setup-gis) of the Introduction; if you haven't already done so, [download the data package](https://github.com/mstrimas/ebird-best-practices/raw/master/data/data.zip) now to get that boundary. Given a set of spatial features, the `MODIS` package can quickly tell us which MODIS tiles we need.

```{r habitat-dl-data}
library(sf)
library(raster)
library(MODIS)
library(velox)
library(viridis)
library(tidyverse)
# resolve namespace conflicts
select <- dplyr::select
projection <- raster::projection

# bcr 27 boundary
bcr <- read_sf("data/gis-data.gpkg", "bcr") %>% 
  filter(bcr_code == 27)
# load ebird data
ebird <- read_csv("data/ebd_woothr_june_bcr27_zf.csv")
# get list of tiles required to cover this bcr
tiles <- getTile(bcr)
tiles@tile
```

So, we'll need to download these three tiles for each of the 10 years from 2009-2018: `r cat(paste0(tiles@tile, collapse = ", "))`.

### `MODIS` setup {#habit-dl-setup}

Before we start using `MODIS` for the first time, a bit of setup is required. First, [sign up for a NASA Earthdata account](https://urs.earthdata.nasa.gov/users/new) to get access to MODIS, and other NASA data. Then use `MODIS::EarthdataLogin(usr = "username", pwd = "password")`, with the username and password you just created, to store your login credentials so the `MODIS` package can access them.

Next, we need to install [GDAL](https://www.gdal.org/), an open source library for working with geospatial data that's needed for processing the MODIS tiles. The steps for installing GDAL are system dependent:

- **Mac OS X:** First, check if GDAL is installed with HDF4 support by running `gdal-config --formats` in Terminal. If you see `hdf4` in the list, you're don't need to do anything else! If not, [install the Homebrew](https://brew.sh/) package manager by following the [instructions on the website](https://brew.sh/). Then, run the following commands in Terminal to install GDAL:

```
brew tap osgeo/osgeo4mac
brew install hdf4 
brew link --overwrite hdf4 
brew install osgeo-gdal
brew link --force osgeo-gdal
```

- **Windows:** install [OSGeo4W](http://trac.osgeo.org/osgeo4w/), a suite of open source geospatial tools that includes GDAL. In R, run `MODIS:::checkTools("GDAL")`, which will search your system for GDAL and suggest a command such as `MODIS::MODISoptions(gdalPath = "c:/OSGeo4W64/bin")` that will make GDAL available to the `MODIS` package. Run this command and, when it asks, agree to making the settings permanent.
- **Linux:** run `sudo apt-get install gdal-bin` in the terminal.

Finally, run `MODIS:::checkTools("GDAL")` to check that GDAL is installed and that the `MODIS` package can find it. If GDAL can't be found, you'll need to manually locate it and use `MODIS::MODISoptions(gdalPath = "path/to/gdal/")` to tell the `MODIS` package where it is.

### Download using R {#habitat-dl-r}

Once all the setup steps have been completed, we can start downloading some data! The `MODIS` function `runGdal()` downloads and processes MODIS tiles into a single GeoTIFF for each year. Note that at the time of writing, landcover data from 2018 haven't been prepared yet, so we'll use 2017 data for both 2017 and 2018. The key arguments to `runGdal()` are:

- `product`: is the specific MODIS product to download. For a full list of available datasets use `MODIS::getProduct()`.
- `SDSstring`: a string specifying which bands to extract, with zeros for bands to drop and 1 for bands to keep. Most MODIS products have multiple bands stored in a single raster file, for example, reflectances in different wavelength ranges or, in our case, landcover using different landcover classification systems. The [documentation for the MCD12Q1 dataset](https://lpdaac.usgs.gov/dataset_discovery/modis/modis_products_table/mcd12q1_v006) shows that there are 13 bands in the downloaded files, and we're interested in band 2, which contains the UMD landcover classification.
- `tileH` and `tileV`: the vertical and horizontal tile numbers as returned by `getTile()`.
- `begin` and `end`: the start and end dates of the time period from which to extract data. Although the landcover data are only available annually, we need to specify full dates because some other products are available on a more granular basis.
- `outDirPath`: directory to store processed MODIS data.
- `job`: a name for this task, which will become the sub-directory of `outDirPath` within which the processed data are stored.

```{r habitat-dl-r, eval = FALSE}
# earliest year of ebird data
begin_year <- format(min(ebird$observation_date), "%Y.01.01")
# end date for ebird data, mcd12q1 only exists up to 2017
end_year <- min(format(max(ebird$observation_date), "%Y.12.31"), 
                "2017.12.31")
# download tiles and combine into a single raster for each year
tifs <- runGdal(product = "MCD12Q1", collection = "006", SDSstring = "01", 
                tileH = tiles@tileH, tileV = tiles@tileV,
                begin = begin_year, end = end_year, 
                outDirPath = "data", job = "modis") %>% 
  pluck("MCD12Q1.006") %>% 
  unlist()

# rename tifs to have more descriptive names
new_names <- format(as.Date(names(tifs)), "%Y") %>% 
  sprintf("modis_mcd12q1_umd_%s.tif", .) %>% 
  file.path(dirname(tifs), .)
file.rename(tifs, new_names)
```

If everything ran smoothly, we now have annual GeoTIFFs of MODIS landcover data from 2009 to 2017 that we can load into R.

```{r habitat-dl-load}
# load the landcover data
landcover <- list.files("data/modis", "^modis_mcd12q1_umd", 
                        full.names = TRUE) %>% 
  stack()
# label layers with year
landcover <- names(landcover) %>% 
  str_extract("(?<=modis_mcd12q1_umd_)[0-9]{4}") %>% 
  paste0("y", .) %>% 
  setNames(landcover, .)
landcover
```

### Troubleshooting {#habitat-dl-trouble}

If the call to `runGDAL()` didn't work for you, don't worry, you're not alone! It's challenging to get the `MODIS` package working and errors are common when you're first trying to get it set up. The most common error is not having GDAL installed correctly, which will give an error like `GDAL not installed or configured`. Either you don't have GDAL at all or you have it, but it doesn't have support for HDF4 files (this is the native format for MODIS data). Try following the [above instructions](habit-dl-setup) again. If it still doesn't work, consult the instructions on the `MODIStsp` website for [installing GDAL](http://ropensci.github.io/MODIStsp/articles/installation.html#installing-gdal-1-11-1). 

Another error you may see is: `Make sure either 'wget' or 'curl' is available in order to download data from LP DAAC or NSIDC.`. This should only arise on version of Windows before Windows 10. If you see this error, you'll need to install `curl`, which is used by R to download the MODIS tiles. There is a StackOverflow question with [excellent instructions](https://stackoverflow.com/questions/9507353/how-do-i-install-and-use-curl-on-windows) for installing `curl` and getting it setup on your system.

If these tips haven't solved your particular problem, head over to the GitHub repository and file an issue. We'll help you get things working *and* you'll help us improve this troubleshooting section! **When you file an issue, please provide the output of `sessionInfo()` so we know a little bit about your system setup.**

## Landscape metrics {#habitat-lsm}

At this point we could use the MODIS landcover data directly, simply extracting the landcover class for each checklist location. However, we instead advocate summarizing the landcover data within a neighborhood around the checklist locations. As discussed in Section \@ref(intro-intro), checklist locations are not precise, so it's more appropriate to use the habitat in the surrounding area, rather than only at the checklist location. More fundamentally, organisms interact with their environment not at a single point, but at the scale of a landscape, so it's important to include habitat information characterizing a suitably-sized landscape around the observation location.

There are a variety of **landscape metrics** that can be used to characterize the composition (what habitat is available) and configuration (how that habitat is arranged spatially) of landscapes. The simplest metric of landscape composition is the percentage of the landscape in each landcover class (PLAND in the parlance of [FRAGSTATS](https://www.umass.edu/landeco/research/fragstats/fragstats.html)). For a broad range of scenarios, PLAND is a reliable choice for calculating habitat covariates in distribution modeling. Based on our experience working with eBird data, an approximately 2.5 km by 2.5 km square neighborhood (5 by 5 MODIS cells) centered on the checklist location is sufficient to account for the spatial precision in the data when the maximum distance of travelling counts has been limited to 5 km, while being a relevant ecological scale for many bird species.

We'll start by finding the full set of unique checklists locations for each year in the eBird data. Then we convert these locations to spatial `sf` features and project them to the sinusoidal equal area projection used by MODIS. We'll buffer these points to create square polygons around each location that are the size of a 5 by 5 grid of MODIS landcover cells. Finally, we split the neighborhoods up by year so we can match to MODIS landcover data from the corresponding year.

```{r habitat-lsm-buffer}
neighborhood_radius <- 2.5 * ceiling(max(res(landcover)))
ebird_buff <- ebird %>% 
  distinct(year = format(observation_date, "%Y"),
           locality_id, latitude, longitude) %>% 
  # for 2018 use 2017 landcover data
  mutate(year_lc = if_else(as.integer(year) > 2017, "2017", year),
         year_lc = paste0("y", year_lc)) %>% 
  # convert to spatial features
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  # transform to modis projection
  st_transform(crs = projection(landcover)) %>% 
  # buffer to create square neighborhood around each point
  st_buffer(dist = neighborhood_radius, endCapStyle = "SQUARE") %>% 
  # nest by year
  nest(-year_lc)
```

Now, we'll loop over the years and for each square neighborhood extract all the raster values within that neighborhood. We use the `velox` package for this, since it's often orders of magnitude faster than using `raster::extract()`.

```{r habitat-lsm-extract}
# function to calculate pland for all checklists in a given year
calculate_pland <- function(yr, regions, lc) {
  # create a lookup table to get locality_id from row number
  locs <- st_set_geometry(regions, NULL) %>% 
    mutate(id = row_number())
  
  # extract using velox
  lc_vlx <- velox(lc[[yr]])
  lc_vlx$extract(regions, df = TRUE) %>% 
    # velox doesn't properly name columns, fix that
    set_names(c("id", "landcover")) %>% 
    # join to lookup table to get locality_id
    inner_join(locs, ., by = "id") %>% 
    select(-id)
}
# iterate over all years calculating pland for all checklists in each
lc_extract <- ebird_buff %>% 
  mutate(pland = map2(year_lc, data, calculate_pland, lc = landcover)) %>% 
  select(pland) %>% 
  unnest()
```

Now we have the set of landcover values within a neighborhood around each checklist location. We can summarize these data within each neighborhood to calculate PLAND: the proportion of the neighborhood within each landcover class. 

```{r habitat-lsm-pland}
pland <- lc_extract %>% 
  # count landcovers
  count(locality_id, year, landcover) %>% 
  # calculate proporiton
  group_by(locality_id, year) %>% 
  mutate(pland = n / sum(n)) %>% 
  ungroup() %>% 
  select(-n) %>% 
  # remove NAs after tallying so pland is relative to total number of cells
  filter(!is.na(landcover))

# tranform to wide format, filling in implicit missing values with 0s
pland <- pland %>% 
  mutate(landcover = paste0("pland_", str_pad(landcover, 2, pad = "0"))) %>% 
  spread(landcover, pland, fill = 0)

# save
write_csv(pland, "data/modis_pland_location-year.csv")
```

## Prediction surface {#habitat-prediction}

After fitting species distribution models, the goal is typically to make predictions throughout the study area. To do this, we'll need a regular grid of habitat covariates over which to make predictions. In this section, we'll create such a prediction surface for BCR 27 using the MODIS landcover data from 2017. To start, we'll need a template raster with cells equal in size to the neighborhoods we defined in the previous section: 5 by 5 MODIS landcover cells. We can use `raster::aggregate()` to achieve this. We'll also use `raster::rasterize()` to assign the value 1 to all cells within BCR 27 and leave all cells outside BCR 27 empty.

```{r habitat-prediction-template}
agg_factor <- round(2 * neighborhood_radius / res(landcover))
r <- raster(landcover) %>% 
  aggregate(agg_factor) 
r <- bcr %>% 
  st_transform(crs = projection(r)) %>% 
  rasterize(r, field = 1) %>% 
  # remove any empty cells at edges
  trim(filename = "data/prediction-surface.tif", overwrite = TRUE)
```

Next, for each cell of this raster, we'll calculate the PLAND metrics using the same approach as the previous section.

```{r habitat-prediction-calc}
# get cell centers and create neighborhoods
r_centers <- rasterToPoints(r, spatial = TRUE) %>% 
  st_as_sf() %>% 
  transmute(id = row_number())
r_cells <- st_buffer(r_centers, dist = neighborhood_radius,
                     endCapStyle = "SQUARE")

# extract landcover values within neighborhoods, only need 2017
lc_vlx <- velox(landcover[["y2017"]])
lc_extract_pred <- lc_vlx$extract(r_cells, df = TRUE) %>% 
  set_names(c("id", "landcover"))

# calculate the percent for each landcover class
pland_pred <- lc_extract_pred %>% 
  count(id, landcover) %>% 
  group_by(id) %>% 
  mutate(pland = n / sum(n)) %>% 
  ungroup() %>% 
  select(-n) %>% 
  # remove NAs after tallying so pland is relative to total number of cells
  filter(!is.na(landcover))

# tranform to wide format, filling in implicit missing values with 0s
pland_pred <- pland_pred %>% 
  mutate(landcover = paste0("pland_", str_pad(landcover, 2, pad = "0"))) %>% 
  spread(landcover, pland, fill = 0) %>% 
  mutate(year = 2017L) %>% 
  select(id, year, everything())

# join in coordinates
pland_coords <- st_transform(r_centers, crs = 4326) %>% 
  st_coordinates() %>% 
  as.data.frame() %>% 
  cbind(id = r_centers$id, .) %>% 
  rename(longitude = X, latitude = Y) %>% 
  inner_join(pland_pred, by = "id")

# save
write_csv(pland_coords, "data/modis_pland_prediction-surface.csv")
glimpse(pland_coords)
```

Keeping these data in a data frame is a compact way to store them and will be required once we make model predictions in later chapters. However, we can always use the raster template to convert these PLAND metrics into a spatial format, for example, if we want to map them. Let's look at how this works for landcover class 4: deciduous broadleaf forest.

```{r habitat-prediction-map}
forest_cover <- pland_coords %>% 
  # convert to spatial features
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  st_transform(crs = projection(r)) %>% 
  # rasterize points
  rasterize(r, field = "pland_04") %>% 
  # project to albers equal-area for mapping
  projectRaster(crs = st_crs(102003)$proj4string, method = "ngb") %>% 
  # trim off empty edges of raster
  trim()

# make a map
par(mar = c(0.25, 0.25, 2, 0.25))
plot(forest_cover, axes = FALSE, box = FALSE, col = viridis(10),
     main = "Proportion of Deciduous Broadleaf Forest\n2017 MODIS Landcover")
```

This completes the data preparation and the remaining chapters will focus on using these data to predict species distributions.