7-exponentialsmoothing.Rmd

---
title: "ETC3550: Applied forecasting for business and economics"
author: "Ch7. Exponential smoothing"
date: "OTexts.org/fpp2/"
fontsize: 14pt
output:
  beamer_presentation:
    fig_width: 7
    fig_height: 3.5
    highlight: tango
    theme: metropolis
    includes:
      in_header: header.tex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache=TRUE, warning=FALSE, message=FALSE)
library(fpp2)
source("nicefigs.R")
options(digits=4)
```

# Simple exponential smoothing

## Simple methods
\fontsize{14}{16}\sf

Time series $y_1,y_2,\dots,y_T$.

\begin{block}{Random walk forecasts}
  \centerline{$\pred{y}{T+h}{T} = y_T$}
\end{block}\pause

\begin{block}{Average forecasts}
  \centerline{$\displaystyle\pred{y}{T+h}{T} = \frac1T\sum_{t=1}^T y_t$}
\end{block}\pause\vspace*{-0.2cm}

* Want something in between that weights most recent data more highly.
* Simple exponential smoothing uses a weighted moving average with weights that decrease exponentially.

## Simple Exponential Smoothing

\begin{block}{Forecast equation}
$\pred{y}{T+1}{T} = \alpha y_T + \alpha(1-\alpha) y_{T-1} + \alpha(1-\alpha)^2 y_{T-2}+ \cdots$
\end{block}
where $0 \le \alpha \le1$.\pause\vspace*{0.2cm}

\small\begin{tabular}{lllll}
\toprule
& \multicolumn{4}{l}{Weights assigned to observations for:}\\
Observation  &   $\alpha = 0.2$   &   $\alpha = 0.4$  &   $\alpha = 0.6$  & $\alpha = 0.8$ \\
\midrule
$y_{T}$      & 0.2         & 0.4          & 0.6         & 0.8\\
$y_{T-1}$    & 0.16        & 0.24         & 0.24        & 0.16\\
$y_{T-2}$    & 0.128       & 0.144        & 0.096       & 0.032\\
$y_{T-3}$    & 0.1024      & 0.0864       & 0.0384      & 0.0064\\
$y_{T-4}$    & $(0.2)(0.8)^4$  & $(0.4)(0.6)^4$   & $(0.6)(0.4)^4$  & $(0.8)(0.2)^4$\\
$y_{T-5}$    & $(0.2)(0.8)^5$  & $(0.4)(0.6)^5$   & $(0.6)(0.4)^5$  & $(0.8)(0.2)^5$\\
\bottomrule
\end{tabular}

## Simple Exponential Smoothing
\fontsize{14}{16}\sf

\begin{block}{Component form}\vspace*{-0.4cm}
\begin{align*}
\text{Forecast equation}&&\pred{y}{t+h}{t} &= \ell_{t}\\
\text{Smoothing equation}&&\ell_{t} &= \alpha y_{t} + (1 - \alpha)\ell_{t-1}
\end{align*}
\end{block}\vspace*{-0.2cm}

* $\ell_t$ is the level (or the smoothed value) of the series at time t.
* $\pred{y}{t+1}{t} = \alpha y_t + (1-\alpha) \pred{y}{t}{t-1}$\newline
  Iterate to get exponentially weighted moving average form.

\begin{block}{Weighted average form}
$\displaystyle\pred{y}{T+1}{T}=\sum_{j=0}^{T-1} \alpha(1-\alpha)^j y_{T-j}+(1-\alpha)^T \ell_{0}$
\end{block}

## Optimisation

  * Need to choose value for $\alpha$ and $\ell_0$
  * Similarly to regression --- we choose $\alpha$ and $\ell_0$ by minimising SSE:
$$
  \text{SSE}=\sum_{t=1}^T(y_t - \pred{y}{t}{t-1})^2.
$$
  * Unlike regression there is no closed form solution --- use numerical optimization.

## Example: Oil production

\fontsize{10}{11}\sf

```{r sesfit, echo=TRUE, cache=TRUE}
oildata <- window(oil, start=1996)
# Estimate parameters
fc <- ses(oildata, h=5)
summary(fc[["model"]])
```

```{r sesparam, echo=FALSE, cache=TRUE}
#tmp <- accuracy(fc)
#print(round(c(tmp[,c("MAE","RMSE","MAPE")],SSE=sum(residuals(fc)^2)),1))
alpha <- fc$model$par[1]
l0 <- fc$model$par[2]
```

## Example: Oil production

\fontsize{8}{8}\sf\vspace*{-0.2cm}

```{r oilses, echo=FALSE, cache=TRUE}
# Data set for table
x <- oildata
# Generate forecasts
fc <- ses(x, h=3)
# Now set up the table
n <- length(x)
year0 <- min(time(x))-1
tab <- matrix(NA,nrow=n+6,ncol=5)
colnames(tab) <- c("Year","Time","Observation","Level","Forecast")
tab[2:(n+6),1] <- year0 + 0:(n+4)
tab[2:(n+6),2] <- 0:(n+4)
# Add data, level and fitted values
tab[3:(n+2),3] <- x
tab[2:(n+2),4] <- fc$model$state
tab[3:(n+2),5] <- fitted(fc)
# Add forecasts
tab[n+(4:6),1] <- max(time(x))+1:3
tab[n+(4:6),2] <- 1:3
tab[n+(4:6),5] <- fc$mean
# Convert to characters
tab <- as.data.frame(tab)
class(tab$Year) <- class(tab$Time) <- "integer"
tab <- format(tab, digits=5)
# Remove missing values
tab <- apply(tab, 2, function(x){j <- grep("[ ]*NA",x); x[j] <- ""; return(x)})
# Add math notation rows
tab[1,] <- c("","$t$","$y_t$","$\\ell_t$","$\\hat{y}_{t+1|t}$")
tab[n+3,] <- c("","$h$","","","$\\hat{y}_{T+h|T}$")
# Show table
knitr::kable(tab, booktabs=TRUE)
```

## Example: Oil production

\fontsize{12}{12}\sf

```{r ses, echo=TRUE, cache=TRUE}
autoplot(fc) +
  autolayer(fitted(fc), series="Fitted") +
  ylab("Oil (millions of tonnes)") + xlab("Year")
```

# Trend methods

## Holt's linear trend

\begin{block}{Component form}\vspace*{-.4cm}
\begin{align*}
\text{Forecast }&& \pred{y}{t+h}{t} &= \ell_{t} + hb_{t} \\
\text{Level }&& \ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\
\text{Trend }&& b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)b_{t-1},
\end{align*}
\end{block}
\pause\vspace*{-0.2cm}

  * Two smoothing parameters $\alpha$ and $\beta^*$ ($0\le\alpha,\beta^*\le1$).
  * $\ell_t$ level: weighted average between $y_t$ and one-step ahead forecast for time $t$, $(\ell_{t-1} + b_{t-1}=\pred{y}{t}{t-1})$
  * $b_t$ slope: weighted average of $(\ell_{t} - \ell_{t-1})$ and $b_{t-1}$, current and previous estimate of slope.
  * Choose $\alpha, \beta^*, \ell_0, b_0$ to minimise SSE.

## Holt's method in R
\fontsize{12}{15}\sf

```{r, fig.height=3.6, echo=TRUE}
window(ausair, start=1990, end=2004) %>%
  holt(h=5, PI=FALSE) %>%
  autoplot()
```

## Damped trend method
\begin{block}{Component form}\vspace*{-0.4cm}
\begin{align*}
\pred{y}{t+h}{t} &= \ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t} \\
\ell_{t} &= \alpha y_{t} + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})\\
b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 -\beta^*)\phi b_{t-1}.
\end{align*}
\end{block}
\pause

  * Damping parameter $0<\phi<1$.
  * If $\phi=1$, identical to Holt's linear trend.
  * As $h\rightarrow\infty$, $\pred{y}{T+h}{T}\rightarrow \ell_T+\phi b_T/(1-\phi)$.
  * Short-run forecasts trended, long-run forecasts constant.

## Example: Air passengers
\fontsize{12}{15}\sf

```{r, echo=TRUE, fig.height=3.6}
window(ausair, start=1990, end=2004) %>%
  holt(damped=TRUE, h=5, PI=FALSE) %>%
  autoplot()
```

## Example: Sheep in Asia
\fontsize{13}{15}\sf

```{r, echo=TRUE}
livestock2 <- window(livestock, start=1970,
                     end=2000)
fit1 <- ses(livestock2)
fit2 <- holt(livestock2)
fit3 <- holt(livestock2, damped = TRUE)
```

```r
accuracy(fit1, livestock)
accuracy(fit2, livestock)
accuracy(fit3, livestock)
```

## Example: Sheep in Asia
\fontsize{13}{15}\sf

```{r echo=FALSE}
tab <- matrix(NA, ncol=3,nrow=10)
colnames(tab) <- c("SES","Linear trend","Damped trend")
rownames(tab) <- c("$\\alpha$","$\\beta^*$","$\\phi$","$\\ell_0$","$b_0$",
                   "Training RMSE","Test RMSE","Test MAE","Test MAPE","Test MASE")
# SSE
tab[1,1] <- fit1$model$par["alpha"]
tab[4,1] <- fit1$model$par["l"]
tab[6,1] <- sqrt(fit1$model$mse)
tab[c(7:10),1] <- accuracy(fit1,livestock)["Test set",c("RMSE","MAE","MAPE","MASE")]
# Holt
tab[1,2] <- fit2$model$par["alpha"]
tab[2,2] <- fit2$model$par["beta"]/fit1$model$par["alpha"]
tab[4,2] <- fit2$model$par["l"]
tab[5,2] <- fit2$model$par["b"]
tab[6,2] <- sqrt(fit2$model$mse)
tab[c(7:10),2] <- accuracy(fit2,livestock)["Test set",c("RMSE","MAE","MAPE","MASE")]
# Damped trend
tab[1,3] <- fit3$model$par["alpha"]
tab[2,3] <- fit3$model$par["beta"]/fit1$model$par["alpha"]
tab[3,3] <- fit3$model$par["phi"]
tab[4,3] <- fit3$model$par["l"]
tab[5,3] <- fit3$model$par["b"]
tab[6,3] <- sqrt(fit3$model$mse)
tab[c(7:10),3] <- accuracy(fit3,livestock)["Test set",c("RMSE","MAE","MAPE","MASE")]
# Convert to characters
tab <- as.data.frame(formatC(tab, format="f", digits=2))
# Remove missing values
tab <- apply(tab, 2, function(x){j <- grep("[ ]*NA",x); x[j] <- ""; return(x)})
# Show table
knitr::kable(tab, booktabs=TRUE)

```

```{r fig-7-comp}
tmp <- cbind(Data=window(livestock, start=1970),
  SES=fit1$mean, "Holt's"=fit2$mean, "Damped trend"=fit3$mean)
autoplot(tmp) + xlab("Year") +
  ylab("Livestock, sheep in Asia (millions)") +
  scale_color_manual(name="",
    values=c("#dd0000","#000000","#00dd00","#0000dd"),
    breaks=c("Data","SES","Holt's","Damped trend"))
```

## Your turn

`eggs` contains the price of a dozen eggs in the United States from 1900–1993

 1. Use SES and Holt’s method (with and without damping) to forecast “future” data.

     [Hint: use h=100 so you can clearly see the differences between the options when plotting the forecasts.]
 1. Which method gives the best training RMSE?
 1. Are these RMSE values comparable?
 1. Do the residuals from the best fitting method look like white noise?

# Seasonal methods
## Holt-Winters additive method
\fontsize{13}{15}\sf

Holt and Winters extended Holt's method to capture seasonality.
\begin{block}{Component form}\vspace*{-0.4cm}
\begin{align*}
\pred{y}{t+h}{t} &= \ell_{t} + hb _{t} + s_{t+h-m(k+1)} \\
\ell_{t} &= \alpha(y_{t} - s_{t-m}) + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\
b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)b_{t-1}\\
s_{t} &= \gamma (y_{t}-\ell_{t-1}-b_{t-1}) + (1-\gamma)s_{t-m},
\end{align*}
\end{block}\fontsize{12}{14}\sf

  * $k=$ integer part of $(h-1)/m$. Ensures estimates from the final year are used for forecasting.
  * Parameters:&nbsp; $0\le \alpha\le 1$,&nbsp; $0\le \beta^*\le 1$,&nbsp; $0\le \gamma\le 1-\alpha$&nbsp;  and $m=$  period of seasonality (e.g. $m=4$ for quarterly data).

## Holt-Winters additive method

  * Seasonal component is usually expressed as
        $s_{t} = \gamma^* (y_{t}-\ell_{t})+ (1-\gamma^*)s_{t-m}.$
  * Substitute in for $\ell_t$:
        $s_{t} = \gamma^*(1-\alpha) (y_{t}-\ell_{t-1}-b_{t-1})+ [1-\gamma^*(1-\alpha)]s_{t-m}$
  * We set $\gamma=\gamma^*(1-\alpha)$.
  * The usual parameter restriction is $0\le\gamma^*\le1$, which translates to $0\le\gamma\le(1-\alpha)$.

## Holt-Winters multiplicative method
\fontsize{13}{14}\sf

For when seasonal variations are changing proportional to the level of the series.

\begin{block}{Component form}\vspace*{-0.3cm}
    \begin{align*}
        \pred{y}{t+h}{t} &= (\ell_{t} + hb_{t})s_{t+h-m(k+1)}. \\
        \ell_{t} &= \alpha \frac{y_{t}}{s_{t-m}} + (1 - \alpha)(\ell_{t-1} + b_{t-1})\\
        b_{t} &= \beta^*(\ell_{t}-\ell_{t-1}) + (1 - \beta^*)b_{t-1}        \\
        s_{t} &= \gamma \frac{y_{t}}{(\ell_{t-1} + b_{t-1})} + (1 - \gamma)s_{t-m}
    \end{align*}
\end{block}\vspace*{-0.1cm}\fontsize{11}{12}\sf

  * $k$ is integer part of $(h-1)/m$.
  * With additive method $s_t$ is in absolute terms:\newline within each year $\sum_i s_i \approx 0$.
  * With multiplicative method $s_t$ is in relative terms:\newline within each year $\sum_i s_i \approx m$.

## Example: Visitor Nights

```{r 7-HW, echo=TRUE}
aust <- window(austourists,start=2005)
fit1 <- hw(aust,seasonal="additive")
fit2 <- hw(aust,seasonal="multiplicative")
```

```{r, fig.height=3.2}
tmp <- cbind(Data=aust,
  "HW additive forecasts" = fit1[["mean"]],
  "HW multiplicative forecasts" = fit2[["mean"]])

autoplot(tmp) + xlab("Year") +
  ylab("International visitor night in Australia (millions)") +
  scale_color_manual(name="",
    values=c('#000000','#1b9e77','#d95f02'),
    breaks=c("Data","HW additive forecasts","HW multiplicative forecasts"))
```

## Estimated components

```{r fig-7-LevelTrendSeas}
addstates <- fit1$model$states[,1:3]
multstates <- fit2$model$states[,1:3]
colnames(addstates) <- colnames(multstates) <-
  c("level","slope","season")
p1 <- autoplot(addstates, facets=TRUE) + xlab("Year") +
  ylab("") + ggtitle("Additive states")
p2 <- autoplot(multstates, facets=TRUE) + xlab("Year") +
  ylab("") + ggtitle("Multiplicative states")
gridExtra::grid.arrange(p1,p2,ncol=2)
```

## Holt-Winters damped method
Often the single most accurate forecasting method for seasonal data:
\begin{block}{}\vspace*{-0.4cm}
\begin{align*}
\pred{y}{t+h}{t} &= [\ell_{t} + (\phi+\phi^2 + \dots + \phi^{h})b_{t}]s_{t+h-m(k+1)} \\
\ell_{t} &= \alpha(y_{t} / s_{t-m}) + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1})\\
b_{t} &= \beta^*(\ell_{t} - \ell_{t-1}) + (1 - \beta^*)\phi b_{t-1}       \\
s_{t} &= \gamma \frac{y_{t}}{(\ell_{t-1} + \phi b_{t-1})} + (1 - \gamma)s_{t-m}
\end{align*}
\end{block}

## Your turn

Apply Holt-Winters’ multiplicative method to the `gas` data.

 1. Why is multiplicative seasonality necessary here?
 1. Experiment with making the trend damped.
 1. Check that the residuals from the best method look like white noise.

# Taxonomy of exponential smoothing methods

## Exponential smoothing methods
\fontsize{12}{14}\sf

\begin{block}{}
\begin{tabular}{ll|ccc}
& &\multicolumn{3}{c}{\bf Seasonal Component} \\
\multicolumn{2}{c|}{\bf Trend}& N & A & M\\
\multicolumn{2}{c|}{\bf Component}  & (None)    & (Additive)  & (Multiplicative)\\
\cline{3-5} &&&&\\[-0.4cm]
N & (None) & (N,N) & (N,A) & (N,M)\\
&&&&\\[-0.4cm]
A & (Additive) & (A,N) & (A,A) & (A,M)\\
&&&&\\[-0.4cm]
A\damped & (Additive damped) & (A\damped,N) & (A\damped,A) & (A\damped,M)
\end{tabular}
\end{block}\fontsize{12}{14}\sf

\begin{tabular}{lp{9.7cm}}
\textcolor[rgb]{0.90,0.,0.00}{(N,N)}:        &Simple exponential smoothing\\
\textcolor[rgb]{0.90,0.,0.00}{(A,N)}:        &Holt's linear method\\
\textcolor[rgb]{0.90,0.,0.00}{(A\damped,N)}: &Additive damped trend method\\
\textcolor[rgb]{0.90,0.,0.00}{(A,A)}:~~ &Additive Holt-Winters' method\\
\textcolor[rgb]{0.90,0.,0.00}{(A,M)}: &Multiplicative Holt-Winters' method\\
\textcolor[rgb]{0.90,0.,0.00}{(A\damped,M)}: &Damped multiplicative Holt-Winters' method
\end{tabular}

\begin{block}{}\fontsize{12}{14}\sf
There are also multiplicative trend methods (not recommended).
\end{block}

## Recursive formulae

\placefig{0}{1.4}{width=12.8cm}{pegelstable.pdf}

## R functions
\fontsize{11.5}{13}\sf

* Simple exponential smoothing: no trend. \newline
  `ses(y)`
* Holt's method: linear trend. \newline
  `holt(y)`
* Damped trend method. \newline
  `holt(y, damped=TRUE)`
* Holt-Winters methods\newline
  `hw(y, damped=TRUE, seasonal="additive")`\newline
  `hw(y, damped=FALSE, seasonal="additive")`\newline
  `hw(y, damped=TRUE, seasonal="multiplicative")`\newline
  `hw(y, damped=FALSE, seasonal="multiplicative")`

* Combination of no trend with seasonality not possible using these functions.

# Innovations state space models

## Methods v Models

### Exponential smoothing methods

  * Algorithms that return point forecasts.

\pause

### Innovations state space models

  * Generate same point forecasts but can also generate forecast intervals.
  * A stochastic (or random) data generating process that can generate an entire forecast distribution.
  * Allow for "proper" model selection.

## ETS models

   * Each model has an \textit{observation} equation and \textit{transition} equations, one for each state (level, trend, seasonal), i.e., state space models.
   * Two models for each method: one with additive and one with multiplicative errors, i.e., in total \color{orange}{18 models}.
   * ETS(Error,Trend,Seasonal):
      * Error $=\{$A,M$\}$
      * Trend $=\{$N,A,A\damped$\}$
      * Seasonal $=\{$N,A,M$\}$.

## Exponential smoothing methods
\fontsize{12}{14}\sf

\begin{block}{}
\begin{tabular}{ll|ccc}
& &\multicolumn{3}{c}{\bf Seasonal Component} \\
\multicolumn{2}{c|}{\bf Trend}& N & A & M\\
\multicolumn{2}{c|}{\bf Component}  & ~(None)~    & (Additive)  & (Multiplicative)\\
\cline{3-5} &&&&\\[-0.3cm]
N & (None) & N,N & N,A & N,M\\
&&&&\\[-0.3cm]
A & (Additive) & A,N & A,A & A,M\\
&&&&\\[-0.3cm]
A\damped & (Additive damped) & A\damped,N & A\damped,A & A\damped,M
\end{tabular}
\end{block}

\vspace*{10cm}

## Exponential smoothing methods
\fontsize{12}{14}\sf

\begin{block}{}
\begin{tabular}{ll|ccc}
& &\multicolumn{3}{c}{\bf Seasonal Component} \\
\multicolumn{2}{c|}{\bf Trend}& N & A & M\\
\multicolumn{2}{c|}{\bf Component}  & ~(None)~    & (Additive)  & (Multiplicative)\\
\cline{3-5} &&&&\\[-0.3cm]
N & (None) & N,N & N,A & N,M\\
&&&&\\[-0.3cm]
A & (Additive) & A,N & A,A & A,M\\
&&&&\\[-0.3cm]
A\damped & (Additive damped) & A\damped,N & A\damped,A & A\damped,M
\end{tabular}
\end{block}

\begin{tabular}{l@{}p{2.3cm}@{}c@{}l}
\structure{General n\rlap{otation}}
    &       & ~E T S~  & ~:\hspace*{0.3cm}\textbf{E}xponen\textbf{T}ial \textbf{S}moothing               \\ [-0.2cm]
    & \hfill{$\nearrow$\hspace*{-0.1cm}}        & {$\uparrow$} & {\hspace*{-0.2cm}$\nwarrow$} \\
    & \hfill{\textbf{E}rror\hspace*{0.2cm}} & {\textbf{T}rend}      & {\hspace*{0.2cm}\textbf{S}easonal}
\end{tabular}
\pause\vspace*{-0.4cm}

\structure{Examples:}\newline\footnotesize\vspace*{-0.5cm}

\begin{tabular}{ll}
A,N,N: &Simple exponential smoothing with additive errors\\
A,A,N: &Holt's linear method with additive errors\\

M,A,M: &Multiplicative Holt-Winters' method with multiplicative errors
\end{tabular}

\pause
\color{orange}{\bf There are 18 separate models in the ETS framework}

## A model for SES

\begin{block}{Component form}\vspace*{-0.4cm}
\begin{align*}
\text{Forecast equation}&&\pred{y}{t+h}{t} &= \ell_{t}\\
\text{Smoothing equation}&&\ell_{t} &= \alpha y_{t} + (1 - \alpha)\ell_{t-1}
\end{align*}
\end{block}\pause
Forecast error: $e_t = y_t - \pred{y}{t}{t-1} = y_t - \ell_{t-1}$.\pause
\begin{block}{Error correction form}\vspace*{-0.4cm}
\begin{align*}
y_t &= \ell_{t-1} + e_t\\
\ell_{t}
         &= \ell_{t-1}+\alpha( y_{t}-\ell_{t-1})\\
         &= \ell_{t-1}+\alpha e_{t}
\end{align*}
\end{block}\pause\vspace*{-0.2cm}

Specify probability distribution for $e_t$, we assume $e_t = \varepsilon_t\sim\text{NID}(0,\sigma^2)$.

## ETS(A,N,N)

\begin{block}{}\vspace*{-0.4cm}
\begin{align*}
\text{Measurement equation}&& y_t &= \ell_{t-1} + \varepsilon_t\\
\text{State equation}&& \ell_t&=\ell_{t-1}+\alpha \varepsilon_t
\end{align*}
\end{block}
where $\varepsilon_t\sim\text{NID}(0,\sigma^2)$.

  * "innovations" or "single source of error" because same error process, $\varepsilon_t$.
  * Measurement equation: relationship between observations and states.
  * Transition equation(s): evolution of the state(s) through time.

## ETS(A,A,N)

Holt's linear method with additive errors.

  * Assume $\varepsilon_t=y_t-\ell_{t-1}-b_{t-1} \sim \text{NID}(0,\sigma^2)$.
  * Substituting into the error correction equations for Holt's linear method\vspace*{-0.2cm}
  \begin{align*}
      y_t&=\ell_{t-1}+b_{t-1}+\varepsilon_t\\
      \ell_t&=\ell_{t-1}+b_{t-1}+\alpha \varepsilon_t\\
      b_t&=b_{t-1}+\alpha\beta^* \varepsilon_t
  \end{align*}
  * For simplicity, set $\beta=\alpha \beta^*$.

## Your turn
\large

 * Write down the model for ETS(A,Ad,N)

## ETS(A,A,A)

Holt-Winters additive method with additive errors.

\begin{block}{}\vspace*{-0.4cm}
\begin{align*}
\text{Forecast equation} && \hat{y}_{t+h|t} &= \ell_{t} + hb_{t} + s_{t+h-m(k+1)}\\
\text{Observation equation}&& y_t&=\ell_{t-1}+b_{t-1}+s_{t-m} + \varepsilon_t\\
\text{State equations}&& \ell_t&=\ell_{t-1}+b_{t-1}+\alpha \varepsilon_t\\
&&        b_t&=b_{t-1}+\beta \varepsilon_t \\
&&s_t &= s_{t-m} + \gamma\varepsilon_t
\end{align*}
\end{block}

* Forecast errors: $\varepsilon_{t} = y_t - \hat{y}_{t|t-1}$
* $k$ is integer part of $(h-1)/m$.

## Your turn
\large

 * Write down the model for ETS(A,N,A)

## ETS(M,N,N)

SES with multiplicative errors.

  * Specify relative errors  $\varepsilon_t=\frac{y_t-\pred{y}{t}{t-1}}{\pred{y}{t}{t-1}}\sim \text{NID}(0,\sigma^2)$
  * Substituting $\pred{y}{t}{t-1}=\ell_{t-1}$ gives:
    * $y_t = \ell_{t-1}+\ell_{t-1}\varepsilon_t$
    * $e_t = y_t - \pred{y}{t}{t-1} = \ell_{t-1}\varepsilon_t$

 \pause
\begin{block}{}\vspace*{-0.4cm}
\begin{align*}
\text{Measurement equation}&& y_t &= \ell_{t-1}(1 + \varepsilon_t)\\
\text{State equation}&& \ell_t&=\ell_{t-1}(1+\alpha \varepsilon_t)
\end{align*}
\end{block}
\pause

  * Models with additive and multiplicative errors with the same parameters generate the same point forecasts but different prediction intervals.

## ETS(M,A,N)

Holt's linear method with multiplicative errors.

  * Assume $\varepsilon_t=\frac{y_t-(\ell_{t-1}+b_{t-1})}{(\ell_{t-1}+b_{t-1})}$
  * Following a similar approach as above, the innovations state space model underlying Holt's linear method with multiplicative errors is specified as\vspace*{-0.4cm}
  \begin{align*}
      y_t&=(\ell_{t-1}+b_{t-1})(1+\varepsilon_t)\\
      \ell_t&=(\ell_{t-1}+b_{t-1})(1+\alpha \varepsilon_t)\\
      b_t&=b_{t-1}+\beta(\ell_{t-1}+b_{t-1}) \varepsilon_t
  \end{align*}
  where again  $\beta=\alpha \beta^*$ and $\varepsilon_t \sim \text{NID}(0,\sigma^2)$.

## Additive error models

\placefig{0}{1.5}{width=12.8cm,trim=0 120 0 0,clip=true}{fig_7_ets_add.pdf}

## Multiplicative error models

\placefig{0}{1.5}{width=12.8cm,trim=0 120 0 0,clip=true}{fig_7_ets_multi.pdf}

## Estimating ETS models

  * Smoothing parameters $\alpha$, $\beta$, $\gamma$ and $\phi$, and the initial states $\ell_0$, $b_0$, $s_0,s_{-1},\dots,s_{-m+1}$ are estimated by maximising the "likelihood" = the probability of the data arising from the specified model.
  * For models with additive errors equivalent to minimising SSE.
  * For models with multiplicative errors, \textbf{not} equivalent to minimising SSE.
  * We will estimate models with the \Verb|ets()| function in the forecast package.

## Innovations state space models
\fontsize{12}{14}\sf

Let $\bm{x}_t = (\ell_t, b_t, s_t, s_{t-1}, \dots, s_{t-m+1})$ and
$\varepsilon_t\stackrel{\mbox{\scriptsize iid}}{\sim}
\mbox{N}(0,\sigma^2)$.
\begin{block}{}
\begin{tabular}{lcl}
$y_t$ &=& $\underbrace{h(\bm{x}_{t-1})} +
\underbrace{k(\bm{x}_{t-1})\varepsilon_t}$\\
&& \hspace*{0.5cm}$\mu_t$ \hspace*{1.45cm} $e_t$ \\[0.2cm]
$\bm{x}_t$ &=& $f(\bm{x}_{t-1}) +
g(\bm{x}_{t-1})\varepsilon_t$\\
\end{tabular}
\end{block}

Additive errors
: \mbox{}\vspace*{-0.5cm}\newline
  $k(x)=1$.\qquad $y_t = \mu_{t} + \varepsilon_t$.

Multiplicative errors
: \mbox{}\vspace*{-0.5cm}\newline
  $k(\bm{x}_{t-1}) = \mu_{t}$.\qquad $y_t = \mu_{t}(1 + \varepsilon_t)$.\newline
  $\varepsilon_t = (y_t - \mu_t)/\mu_t$ is relative error.

## Innovations state space models

\structure{Estimation}\vspace*{0.5cm}

\begin{block}{}
\begin{align*}
L^*(\bm\theta,\bm{x}_0) &= n\log\!\bigg(\sum_{t=1}^n \varepsilon^2_t/k^2(\bm{x}_{t-1})\!\bigg) + 2\sum_{t=1}^n \log|k(\bm{x}_{t-1})|\\
&= -2\log(\text{Likelihood}) + \mbox{constant}
\end{align*}
\end{block}

* Estimate parameters $\bm\theta = (\alpha,\beta,\gamma,\phi)$ and
initial states $\bm{x}_0 = (\ell_0,b_0,s_0,s_{-1},\dots,s_{-m+1})$ by
minimizing $L^*$.

## Parameter restrictions
\fontsize{12}{14}\sf

### *Usual* region

  * Traditional restrictions in the methods $0< \alpha,\beta^*,\gamma^*,\phi<1$\newline (equations interpreted as weighted averages).
  * In models we set $\beta=\alpha\beta^*$ and $\gamma=(1-\alpha)\gamma^*$.
  * Therefore $0< \alpha <1$, &nbsp;&nbsp; $0 < \beta < \alpha$ &nbsp;&nbsp; and $0< \gamma < 1-\alpha$.
  * $0.8<\phi<0.98$ --- to prevent numerical difficulties.
 \pause

### *Admissible* region

  * To prevent observations in the distant past having a continuing effect on current forecasts.
  * Usually (but not always) less restrictive than the \textit{traditional} region.
  * For example for ETS(A,N,N): \newline \textit{traditional} $0< \alpha <1$ --- \textit{admissible} is $0< \alpha <2$.

## Model selection
\begin{block}{Akaike's Information Criterion}
\[
\text{AIC} = -2\log(\text{L}) + 2k
\]
\end{block}\vspace*{-0.2cm}
where $L$ is the likelihood and $k$ is the number of parameters initial states estimated in the model.\pause

\begin{block}{Corrected AIC}
\[
\text{AIC}_{\text{c}} = \text{AIC} + \frac{2(k+1)(k+2)}{T-k}
\]
\end{block}
which is the AIC corrected (for small sample bias).
\pause
\begin{block}{Bayesian Information Criterion}
\[
\text{BIC} = \text{AIC} + k(\log(T)-2).
\]
\end{block}

## Automatic forecasting

**From Hyndman et al.\ (IJF, 2002):**

* Apply each model that is appropriate to the data.
Optimize parameters and initial values using MLE (or some other
criterion).
* Select best method using AICc:
* Produce forecasts using best method.
* Obtain forecast intervals using underlying state space model.

Method performed very well in M3 competition.

## Some unstable models

* Some of the combinations of (Error, Trend, Seasonal) can lead to numerical difficulties; see equations with division by a state.
* These are: ETS(A,N,M), ETS(A,A,M), ETS(A,A\damped,M).
* Models with multiplicative errors are useful for strictly positive data, but are not numerically stable with data containing zeros or negative values. In that case only the six fully additive models will be applied.

## Exponential smoothing models
\fontsize{11}{12}\sf

\begin{block}{}
\begin{tabular}{ll|ccc}
  \multicolumn{2}{l}{\alert{\bf Additive Error}} &        \multicolumn{3}{c}{\bf Seasonal Component}         \\
          \multicolumn{2}{c|}{\bf Trend}         &         N         &         A         &         M         \\
        \multicolumn{2}{c|}{\bf Component}       &     ~(None)~      &    (Additive)     & (Multiplicative)  \\ \cline{3-5}
           &                                     &                   &                   &  \\[-0.3cm]
  N        & (None)                              &       A,N,N       &       A,N,A       &    \st{A,N,M}     \\
           &                                     &                   &                   &  \\[-0.3cm]
  A        & (Additive)                          &       A,A,N       &       A,A,A       &    \st{A,A,M}     \\
           &                                     &                   &                   &  \\[-0.3cm]
  A\damped & (Additive damped)                   &   A,A\damped,N    &   A,A\damped,A    & \st{A,A\damped,M}
\end{tabular}
\end{block}

\begin{block}{}
\begin{tabular}{ll|ccc}
  \multicolumn{2}{l}{\alert{\bf Multiplicative Error}} &     \multicolumn{3}{c}{\bf Seasonal Component}      \\
             \multicolumn{2}{c|}{\bf Trend}            &      N       &         A         &        M         \\
           \multicolumn{2}{c|}{\bf Component}          &   ~(None)~   &    (Additive)     & (Multiplicative) \\ \cline{3-5}
           &                                           &              &                   &  \\[-0.3cm]
  N        & (None)                                    &    M,N,N     &       M,N,A       &      M,N,M       \\
           &                                           &              &                   &  \\[-0.3cm]
  A        & (Additive)                                &    M,A,N     &       M,A,A       &      M,A,M       \\
           &                                           &              &                   &  \\[-0.3cm]
  A\damped & (Additive damped)                         & M,A\damped,N &   M,A\damped,A    &   M,A\damped,M
\end{tabular}
\end{block}

## Example: International tourists
\fontsize{7.8}{8.7}\sf

```{r, echo=TRUE}
aust <- window(austourists, start=2005)
fit <- ets(aust)
summary(fit)
```

## Example: International tourists

Model selected: ETS(M,A,M)
\begin{align*}
y_{t} &= (\ell_{t-1} + b_{t-1})s_{t-m}(1 + \varepsilon_t)\\
\ell_t &= (\ell_{t-1} + b_{t-1})(1 + \alpha \varepsilon_t)\\
b_t &=b_{t-1} + \beta(\ell_{t-1} + b_{t_1})\varepsilon_t\\
s_t &=  s_{t-m}(1+ \gamma \varepsilon_t).
\end{align*}

$\hat\alpha=`r format(fit$par[1],nsmall=4,digits=4)`$, $\hat\beta=`r format(fit$par[2],nsmall=3,digits=3, scientific=FALSE)`$, and $\hat\gamma=`r format(fit$par[3],digits=2,nsmall=2, scientific=FALSE)`$.

## Example: International tourists

```{r MAMstates, fig.height=3.5,fig.width=6, echo=TRUE}
autoplot(fit)
```

## Example: International tourists
\fontsize{9.5}{12}\sf
```{r, echo=TRUE}
cbind('Residuals' = residuals(fit),
      'Forecast errors' = residuals(fit, type='response')) %>%
  autoplot(facet=TRUE) + xlab("Year") + ylab("")
```

## Residuals
\fontsize{16}{18}\sf

### Response residuals
$$\hat{e}_t = y_t - \hat{y}_{t|t-1}$$

### Innovation residuals
Additive error model:
$$\hat\varepsilon_t = y_t - \hat{y}_{t|t-1}$$

Multiplicative error model:
$$\hat\varepsilon_t = \frac{y_t - \hat{y}_{t|t-1}}{\hat{y}_{t|t-1}}$$

## Forecasting with ETS models

\structure{Point forecasts:} iterate the equations for $t=T+1,T+2,\dots,T+h$ and set all $\varepsilon_t=0$ for $t>T$.\pause

* Not the same as $\text{E}(y_{t+h} | \bm{x}_t)$ unless trend and seasonality are both additive.
* Point forecasts for ETS(A,x,y) are identical to ETS(M,x,y) if the parameters are the same.

## Example: ETS(A,A,N)

\vspace*{-1.3cm}

\begin{align*}
y_{T+1} &= \ell_T + b_T  + \varepsilon_{T+1}\\
\hat{y}_{T+1|T} & = \ell_{T}+b_{T}\\
y_{T+2}         & = \ell_{T+1} + b_{T+1} + \varepsilon_{T+2}\\
                & =
                      (\ell_T + b_T + \alpha\varepsilon_{T+1}) +
                      (b_T + \beta \varepsilon_{T+1}) +
                      \varepsilon_{T+2} \\
\hat{y}_{T+2|T} &= \ell_{T}+2b_{T}
\end{align*}
etc.

## Example: ETS(M,A,N)
\fontsize{13}{16}\sf

\vspace*{-1.3cm}

\begin{align*}
y_{T+1} &= (\ell_T + b_T )(1+ \varepsilon_{T+1})\\
\hat{y}_{T+1|T} & = \ell_{T}+b_{T}.\\
y_{T+2}         & = (\ell_{T+1} + b_{T+1})(1 + \varepsilon_{T+2})\\
                & = \left\{
                    (\ell_T + b_T) (1+ \alpha\varepsilon_{T+1}) +
                    \left[b_T + \beta (\ell_T + b_T)\varepsilon_{T+1}\right]
                    \right\}
                   (1 + \varepsilon_{T+2}) \\
\hat{y}_{T+2|T} &= \ell_{T}+2b_{T}
\end{align*}
etc.

## Forecasting with ETS models

\structure{Prediction intervals:} cannot be generated using the methods, only the models.

  * The prediction intervals will differ between models with additive and multiplicative errors.
  * Exact formulae for some models.
  * More general to simulate future sample paths, conditional on the last estimate of the states, and to obtain prediction intervals from the percentiles of these simulated future paths.
  * Options are available in R using the `forecast` function in the forecast package.

## Prediction intervals
\fontsize{12}{13}\sf\vspace*{-0.2cm}

PI for most ETS models: $\hat{y}_{T+h|T} \pm c \sigma_h$, where $c$ depends on coverage probability and $\sigma_h$ is forecast standard deviation.

\fontsize{10}{12}\sf\vspace*{0.2cm}

\hspace*{-0.8cm}\begin{tabular}{ll}
\hline
(A,N,N) & $\sigma_h = \sigma^2\big[1 + \alpha^2(h-1)\big]$\\
(A,A,N) & $\sigma_h = \sigma^2\Big[1 + (h-1)\big\{\alpha^2 + \alpha\beta h + \frac16\beta^2h(2h-1)\big\}\Big]$\\
(A,A$_d$,N) & $\sigma_h = \sigma^2\biggl[1 + \alpha^2(h-1) + \frac{\beta\phi h}{(1-\phi)^2} \left\{2\alpha(1-\phi) +\beta\phi\right\}$\\
      & \hspace*{1.5cm}$\mbox{} - \frac{\beta\phi(1-\phi^h)}{(1-\phi)^2(1-\phi^2)} \left\{ 2\alpha(1-\phi^2)+ \beta\phi(1+2\phi-\phi^h)\right\}\biggr]$\\
(A,N,A) &              $\sigma_h = \sigma^2\Big[1 + \alpha^2(h-1) + \gamma k(2\alpha+\gamma)\Big]$\\
(A,A,A) &              $\sigma_h = \sigma^2\Big[1 + (h-1)\big\{\alpha^2 + \alpha\beta h + \frac16\beta^2h(2h-1)\big\} + \gamma k \big\{2\alpha+ \gamma + \beta m (k+1)\big\} \Big]$\\
(A,A$_d$,A) &  $\sigma_h = \sigma^2\biggl[1 + \alpha^2(h-1) +\frac{\beta\phi h}{(1-\phi)^2} \left\{2\alpha(1-\phi)  + \beta\phi \right\}$\\
  & \hspace*{1.5cm}$\mbox{} - \frac{\beta\phi(1-\phi^h)}{(1-\phi)^2(1-\phi^2)} \left\{ 2\alpha(1-\phi^2)+ \beta\phi(1+2\phi-\phi^h)\right\}$ \\
  & \hspace*{1.5cm}$\mbox{} + \gamma k(2\alpha+\gamma)  + \frac{2\beta\gamma\phi}{(1-\phi)(1-\phi^m)}\left\{k(1-\phi^m) - \phi^m(1-\phi^{mk})\right\}\biggr]$
\end{tabular}

# ETS in R

## Example: drug sales
\fontsize{8}{8}\sf

```{r, echo=TRUE}
ets(h02)
```

## Example: drug sales
\fontsize{8}{8}\sf

```{r, echo=TRUE}
ets(h02, model="AAA", damped=FALSE)
```

## The `ets()` function

* Automatically chooses a model by default using the AIC, AICc or BIC.
* Can handle any combination of trend, seasonality and damping
* Ensures the parameters are admissible (equivalent to invertible)
* Produces an object of class "ets".

## `ets` objects

* **Methods:** `coef()`, `autoplot()`, `plot()`, `summary()`, `residuals()`, `fitted()`, `simulate()` and `forecast()`
* `autoplot()` shows time plots of the original time series along with the extracted components (level, growth and seasonal).

## Example: drug sales
```{r, echo=TRUE, fig.height=4}
h02 %>% ets() %>% autoplot()
```

## Example: drug sales

```{r, echo=TRUE, fig.height=4}
h02 %>% ets() %>% forecast() %>% autoplot()
```

## Example: drug sales
\fontsize{11}{13}\sf

```{r, echo=TRUE}
h02 %>% ets() %>% accuracy()

h02 %>% ets(model="AAA", damped=FALSE) %>% accuracy()
```

## The `ets()` function
\fontsize{8}{10}\sf

`ets()` function also allows refitting model to new data set.

```{r, echo=TRUE}
train <- window(h02, end=c(2004,12))
test <- window(h02, start=2005)
fit1 <- ets(train)
fit2 <- ets(test, model = fit1)
accuracy(fit2)
accuracy(forecast(fit1,10), test)
```

## The `ets()` function in R
\fontsize{12}{13}\sf

```r
ets(y, model = "ZZZ", damped = NULL,
  additive.only = FALSE,
  lambda = NULL, biasadj = FALSE,
  lower = c(rep(1e-04, 3), 0.8),
  upper = c(rep(0.9999, 3), 0.98),
  opt.crit = c("lik","amse","mse","sigma","mae"),
  nmse = 3,
  bounds = c("both", "usual", "admissible"),
  ic = c("aicc", "aic", "bic"),
  restrict = TRUE,
  allow.multiplicative.trend = FALSE, ...)
```

## The `ets()` function in R
\fontsize{13}{14}\sf

* `y` \newline The time series to be forecast.
* `model` \newline use the ETS classification and notation: "N" for none, "A" for additive, "M" for multiplicative, or "Z" for automatic selection. Default `ZZZ` all components are selected using the information criterion.
* `damped`
  * If `damped=TRUE`, then a damped trend will be used (either A\damped\ or M\damped).
  * `damped=FALSE`, then a non-damped trend will used.
  * If `damped=NULL` (default), then either a damped or a non-damped trend will be selected according to the information criterion chosen.

## The `ets()` function in R
\fontsize{13}{14.5}\sf\vspace*{-0.2cm}

  * `additive.only`\newline
      Only models with additive components will be considered if \Verb"additive.only=TRUE". Otherwise all models will be considered.
  * `lambda`\newline
      Box-Cox transformation parameter. It will be ignored if `lambda=NULL` (default). Otherwise, the time series will be transformed before the model is estimated. When `lambda` is not `NULL`, `additive.only` is set to `TRUE`.
  * `biadadj`\newline
      Uses bias-adjustment when undoing Box-Cox transformation for fitted values.

## The `ets()` function in R
\fontsize{12}{13}\sf\vspace*{-0.2cm}

* `lower,upper` bounds for the parameter estimates of $\alpha$, $\beta^*$, $\gamma^*$ and $\phi$.
* `opt.crit=lik` (default) optimisation criterion used for estimation.
* `bounds` Constraints on the parameters.

    * \textit{usual} region -- `"bounds=usual"`;
    * \textit{admissible} region -- `"bounds=admissible"`;
    * `"bounds=both"` (default) requires the parameters to satisfy both sets of constraints.

* `ic=aicc` (default) information criterion to be used in selecting models.
* `restrict=TRUE` (default) models that cause numerical problems not considered in model selection.
* `allow.multiplicative.trend` allows models with a multiplicative trend.

## The `forecast()` function in R
\fontsize{13}{15}\sf\vspace*{-0.2cm}

```r
  forecast(object,
    h=ifelse(object$m>1, 2*object$m, 10),
    level=c(80,95), fan=FALSE,
    simulate=FALSE, bootstrap=FALSE,
    npaths=5000, PI=TRUE,
    lambda=object$lambda, biasadj=FALSE,...)
```

* `object`: the object returned by the \Verb|ets()| function.
* `h`: the number of periods to be forecast.
* `level`: the confidence level for the prediction intervals.
* `fan`: if `fan=TRUE`, suitable for fan plots.

## The `forecast()` function in R
\fontsize{13}{14}\sf\vspace*{-0.2cm}

* `simulate`: If `TRUE`, prediction intervals generated via simulation rather than analytic formulae. Even if `FALSE` simulation will be used if no algebraic formulae exist.
* `bootstrap`: If `bootstrap=TRUE` and \Verb"simulate=TRUE", then simulated prediction intervals use re-sampled errors rather than normally distributed errors.
* `npaths`: The number of sample paths used in computing simulated prediction intervals.
* `PI`: If `PI=TRUE`, then prediction intervals are produced; otherwise only point forecasts are calculated. If `PI=FALSE`, then `level`, `fan`, `simulate`, `bootstrap` and `npaths` are all ignored.

## The `forecast()` function in R

* `lambda`: The Box-Cox transformation parameter. Ignored if `lambda=NULL`. Otherwise, forecasts are back-transformed via inverse Box-Cox transformation.
* `biasadj`: Apply bias adjustment after Box-Cox?

## Your turn

* Use `ets()` on some of these series:\vspace*{0.2cm}

  > `bicoal`, `chicken`, `dole`, `usdeaths`, `bricksq`, `lynx`, `ibmclose`, `eggs`, `bricksq`, `ausbeer`

* Does it always give good forecasts?

* Find an example where it does not work well. Can you figure out why?