04.Rmd

# Classification

**Learning objectives:**

- Compare and contrast **classification** with linear regression.
- Perform classification using **logistic regression**.
- Perform classification using **linear discriminant analysis (LDA)**.
- Perform classification using **quadratic discriminant analysis (QDA)**.
- Perform classification using **naive Bayes**.
- Identify the **strengths and weaknesses** of the various classification models.
- Model count data using **Poisson regression**.


## An Overview of Classification

- **Classification**: Approaches to make inference and/or predict qualitative (categorical) response variable 

- Few common classification techniques (classifiers): 
  - logistic regression
  - linear discriminant analysis (LDA)
  - quadratic discriminant analysis (QDA)
  - naive Bayes
  - K-nearest neighbors

<br> 
- **Examples of classification problems: **
<br>

  1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  
  - Predictor variable: Symptoms
  - Response variable: Type of medical conditions
<br>  
  2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  
  - Predictor variable: User's IP address, past transaction history, etc
  - Response variable: Fraudulent activity (Yes/No)
<br>  
  3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
  
  - Predictor variable: DNA sequence data
  - Response variable: Presence of deleterious gene (Yes/No)
<br>

- In the following section, we are going to explore the `Default` dataset. The annual incomes ($X_1$ = `income`) and monthly credit card balances ($X_2$ =`balance`) are used to predict whether whether an individual will default on his or her credit card payment. 

```{r fig4-1, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="The distribution of balance and income split by the binary default variable respectively; Note. Defaulters represented as orange plus sign; non-defaulters represented as blue circle"}

knitr::include_graphics("./images/fig4_1.jpg", error = FALSE)

```

  
## Why NOT Linear Regression?

- a regression method cannot convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression

$$Y = \left\{ \begin{array}{ll}
          1 & \mbox{if stroke};\\
          2 & \mbox{if epileptic seizure};\\
          3 & \mbox{if drug overdose}.\end{array} \right.$$

- a regression method will not provide meaningful estimates of Pr(Y |X), even with just two classes; partial estimates might be outside the [0, 1] probability interval

```{r fig4-2, cache=FALSE, eval=FALSE, echo=FALSE, fig.align="center", fig.cap="Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right: Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1."}

knitr::include_graphics("./images/fig4_2.jpg", error = FALSE)
```


## Logistic Regression

- **Logistic regression**: models the probability that Y belongs to a particular category (X) 
- X is binary (0/1)

$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
$$p (X) =  \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
$$odds = p (X) =  \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \Longrightarrow {odds \space value [0, ∞]}$$
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$

By logging the whole equation, we get

$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X \Longrightarrow {log \space odds/logit}$$

To estimate the regression coefficient, we use **maximum likelihood (ME)**. 


***Likelihood Function***

$$ℓ (\beta_{0}, \beta_{1}) = \prod_{i: y_{i}= 1} p (x_i) = \prod_{i': y_{i'}= 0} (1- p (x_{i'})) \Longrightarrow {Likelihood \space function}$$

### Multiple Logistic Regression

$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p \\ \Downarrow \\ p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}$$

```{r fig4-3, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown."}

knitr::include_graphics("./images/fig4_3.jpg", error = FALSE)
```


### Multinomial Logistic Regression

- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.

- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline. 

- Alternatively, you can use `Softmax coding, where we _treat all K classes symmetrically_, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes. 

## Generative Models for Classification

**Why Logistic Regression is not ideal?**

- When there is substantial separation between the two classes, the
parameter estimates for the logistic regression model are surprisingly
unstable. 

- If the distribution of the predictors X is approximately normal in
each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.

- Generative modelling can be naturally extended to the case
of more than two response classes. 

<br>
**Common notations:**
<br>
- K $\Longrightarrow$ response class

- $π_k \Longrightarrow$ overall or prior probability that a randomly chosen observation comes from the prior kth class; can be obtained from the random
sample from the population

- $f_k(X) ≡ Pr(X|Y = k)^1 \Longrightarrow$ the density function of X density for an observation that comes from the kth class; requires some underlying assumption to estimate 

<br>

Bayes’ theorem states that

$$Pr(Y = k|X = x) = \frac {π_k f_k(x)}{\sum_{l =1}^{k} π_lf_l(x)}$$

- $p_k(x) = Pr(Y = k|X = x) \Longrightarrow$  _posterior probability_ that an observation posterior X = x belongs to the kth class; computed from $f_k(X)$

## A Comparison of Classification Methods

Each of the classifiers below uses different estimates of $f_k(x)$. 

- linear discriminant analysis;
- quadratic discriminant analysis; 
- naive Bayes

### Linear Discriminant Analysis for p = 1
- one predictor
- classify an observation to the class for which $p_k(x)$ is greatest

**Assumptions:**
- we assume that $f_k(x)$ is normal or Gaussian with a classs pecific
mean and,
- a shared variance term across all K classes [$σ^2_1 = · · · = σ^2_K$ ]

The normal density takes the form

$$f_k(x) = \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)$$

Then, the posterior probability (probability that the observation belongs to the kth class, given the predictor value for that observation) is

$$p_k(x) = \frac{π_k \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_k)^2)}{\sum^k_{l=1} π_l \frac{1}{\sqrt{2πσk}}exp(- \frac{1}{2σ^2_k}(x- \mu_l)^2)}$$

**Additional mathematical formula**

After you log and rearrange the above equation, you will the following formula. The Bayes' classifier assign to one class if $2x (μ_1 − μ_2) > μ_1^2 − μ_2^2$ and otherwise. 

$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$

The Bayes decision boundary is the point for which $δ_1(x) = δ_2(x)$

$$x = \frac{μ_1^2 − μ_2^2}{2(μ_1 − μ_2)} = \frac{μ_1 + μ_2}{2}$$

```{r fig4-4, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: Two one-dimensional normal density functions are shown. The dashed vertical line represents the Bayes decision boundary. Right: 20 observations were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data."}

knitr::include_graphics("./images/fig4_4.jpg", error = FALSE)

```

The **linear discriminant analysis (LDA)** method approximates the linear
discriminant analysis Bayes classifier by plugging estimates for $π_k$, $μ_k$, and σ^2 into equation 4.18. 

$\hat μ_k$ is the average of all the training observations from the kth class
$$\hat{\mu}_{k} = \frac{1}{n_{k}}\sum_{i: y_{i}= k} x_{i}$$

$\hat σ^2$ is the weighted average of the sample variances for each of the K classes

$$\hat{\sigma}^2 = \frac{1}{n - K} \sum_{k = 1}^{K} \sum_{i: y_{i}= k} (x_{i} - \hat{\mu}_{k})^2$$

Note. 
n = total number of training observations, 
$n_k$ = number of training observations in the kth class

$π_k$ is estimated from the proportion of the training observations
that belong to the kth class.

$π_k = \frac{n_k}{n}$

LDA classifier assigns an observation X = x to the class for which $δ_k(x)$ is largest. 

$$δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18} \\ \Downarrow \\ \hat δ_k(x) = x \cdot \frac{\hat \mu_k}{\hat \sigma^2} - \frac{\hat \mu_k^2}{2\hat \sigma^2} + log(\hat π_k)$$

### Linear Discriminant Analysis for p > 1

- multiple predictors; p > 1 predictors

- observations come from a multivariate Gaussian (or multivariate normal) distribution, with a **class-specific mean vector** and a common **covariance matrix**; $$N(μ_k,Σ)$$

**Assumptions: **
- each individual predictor follows a one-dimensional normal distribution, with predictors having some correlation


```{r fig4-5, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Two multivariate Gaussian density functions are shown, with p = 2. Left: The two predictors are uncorrelated and it has a circular base. Var(X_1) = Var(X_2) and Cor(X_1,X_2) = 0; Right: The two variables have a correlation of 0.7 with a elliptical base"}

knitr::include_graphics("./images/fig4_5.jpg", error = FALSE)

```

$\exp$
The multivariate Gaussian density is defined as: 

$$f(x) = \frac{1}{(2π)^{\frac{p}{2}}|Σ|^{\frac{1}{2}}}\exp -\frac{1}{2}(x - \mu)^T Σ^{−1}(x − μ))$$


Bayes classifier assigns an observation X = x to the class for which $$δ_k(x)$$ is largest. 

$$δ_k(x) =  x^T Σ^{−1}μ_k - \frac{1}{2}μ_k^T Σ^{−1} μ_k + log π_k \Longrightarrow vector/matrix \space version \\ δ_k(x) = x . \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(π_k) \Longrightarrow {Equation \space 4.18}$$

```{r fig4-6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "An example with three classes. The observations from each class are drawn from a multivariate Gaussian distribution with p = 2, with a class-specific mean vector and a common covariance matrix. Left: Ellipses that contain 95% of the probability for each of the three classes are shown. The dashed lines are the Bayes decision boundaries. Right: 20 observations were generated from each class, and the corresponding LDA decision boundaries are indicated using solid black lines. The Bayes decision boundaries are once again shown as dashed lines. Overall, the LDA decision boundaries are pretty close to the Bayes decision boundaries, shown again as dashed lines. The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770, respectively."}

knitr::include_graphics("./images/fig4_6.jpg", error = FALSE)

```

All classification models have training error rate, which can be displayed with a **confusion matrix**. 

**Caveats of error rate: **

- training error rates will usually be lower than test error rates, which are the real quantity of interest. The higher the ratio of parameters _p_ to number of samples n, the more we expect this _overfitting_ to play a role.

- the trivial null classifier will achieve an error rate that is only a bit higher than the LDA training set error rate

- a binary classifier such as this one can make two types of errors (Type I and II)

- Class-specific performance _(sensitivity and specificity)_ is important in certain fields (e.g., medicine)


LDA has low sensitivity due to 
1. LDA is trying to approximate the Bayes classifier, which has the lowest
total error rate out of all classifiers
2. In the process, the Bayes classifier will yield the smallest possible total number of misclassified observations, regardless of the class from which the errors stem. 
3. It also uses a threshold of 50% for the posterior probability of default in order to assign an observation to the default class

$$Pr(default = Yes|X = x) > 0.5. \\ Pr(default = Yes|X = x) > 0.2.$$

```{r fig4-7, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The figure illustrates the trade-off that results from modifying the threshold value for the posterior probability of default. For the Default data set, error rates are shown as a function of the threshold value for the posterior probability that is used to perform the assignment. The black solid line displays the overall error rate. The blue dashed line represents the fraction of defaulting customers that are incorrectly classified, and the orange dotted line indicates the fraction of errors among the non-defaulting customers."}

knitr::include_graphics("./images/fig4_7.jpg", error = FALSE)

```

- As the threshold is reduced, the error rate among individuals who default decreases steadily, but the error rate among the individuals who do not default increases. The decision on the threshold must be based on **domain knowledge** (e.g., detailed information about the costs associated with default)

- ROC curve is a way to illustrate the two type of errors at all possible thresholds. 

```{r fig4-8, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that same threshold value. The ideal ROC curve hugs the top left corner, indicating a high true positive rate and a low false positive rate. The dotted line represents the “no information” classifier; this is what we would expect if student status and credit card balance are not associated with probability of default."}

knitr::include_graphics("./images/fig4_8.jpg", error = FALSE)

```

An ideal ROC curve will hug the top left corner, so the larger **area under the ROC curve (AUC)**, the better the classifier. 

```{r tbl4_6, cache=FALSE, echo=FALSE, fig.align="center", fig.cap="Possible results when applying a classifier or diagnostic test to a population"}

library("htmlTable")
library("magrittr")

matrix(c("True Neg. (TN)", "False Pos. (FP)", "N", "False Neg. (FN)", "True Pos. (TP)", "P", "N∗", "P∗", ""),
       ncol = 3,
       dimnames = list("Predicted class" = c(" − or Null", " + or Non-null", "Total"),
                       "True class" = c("Neg. or Null", "Pos. or Non-null", "Total"))) %>% 
  addHtmlTableStyle(align = "lcr") %>% 
  htmlTable

```

Important measures for classification and diagnostic testing:

- **False Positive rate (FP/N)** $\Longrightarrow$ Type I error, 1−Specificity

- **True Positive rate (TP/P)** $\Longrightarrow$ 1−Type II error, power, sensitivity, recall

- **Pos. Predicted value (TP/P∗)** $\Longrightarrow$ Precision, 1−false discovery proportion

- **Neg. Predicted value (TN/N∗)**


### Quadratic Discriminant Analysis (QDA)

- Assumptions similar to LDA, in which observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes’ theorem in order to perform prediction

- QDA assumes that each class has its own covariance matrix

$$X ∼ N(μ_k,Σ_k) \Longrightarrow {Σ_k is \space covariance \space matrix \space for \space the \space kth \space class}$$

**Bayes classifier**

$$δ_k(x) = - \frac{1}{2}(x - \mu_k)^T Σ_k^{−1}(x - \mu_k) - \frac{1}{2}log|Σ_k| + log(π_k) \\ \Downarrow \\ δ_k(x) =  - \frac{1}{2}x^T Σ_k^{−1}x - x^T Σ_k^{−1} \mu_k - \frac{1}{2}μ_k^T Σ_k^{−1} μ_k - \frac{1}{2}log|Σ_k| + log π_k$$

QDA classifier involves plugging estimates for **$Σ_k$, $μ_k$, and $π_k$** into the above equation, and then assigning an observation X = x to the class for which this quantity is **largest**. 

The quantity x appears as a quadratic function, hence the name. 

<br>
**Why the LDA to QDA is preferred or vice-versa?**
<br>
1. **Bias-variance trade-off** 
<br>
  - Pro LDA: LDA assumes that the K classes share a common covariance matrix and the quantity X becomes linear, which means there are $K_p$ linear coefficients to estimate.LDA is a much less flexible classifier than QDA, and so has substantially *lower variance*; improved prediction performance.
  
  - Con LDA: If the assumption K classes share a common covariance matrix is badly off, LDA can suffer from *high bias*
  
  - Conclusion: Use LDA when there is a few training observations; use QDA when the training set is very large or common covariance matrix is untennable.
  

```{r fig4-9, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem with Σ1 = Σ2. The shading indicates the QDA decision rule. Since the Bayes decision boundary is linear, it is more accurately approximated by LDA than by QDA. Right: Details are as given in the left-hand panel, except that Σ1 ̸= Σ2. Since the Bayes decision boundary is non-linear, it is more accurately approximated by QDA than by LDA."}

knitr::include_graphics("./images/fig4_9.jpg", error = FALSE)

```


### Naive Bayes

- Estimating a p-dimensional density function is challenging; naive bayes make a different assumption than LDA and QDA.
- an alternative to LDA that does not assume normally distributed
predictors

$$f_k(x) = f_{k1}(x_1) × f_{k2}(x_2)×· · ·×f{k_p}(x_p),$$
where $f_{kj}$ is the density function of the jth predictor among observations in the kth class 


*Within the kth class, the p predictors are independent.*

**Why naive Bayes is better/powerful?**

1. By assuming that the p covariates are independent within each class, we assumed that there is no association between the predictors! When estimating a p-dimensional density function, it is difficult to calculate the *marginal distribution* of each predictor and *joint distribution* of the predictors. 

2. Although p covariates might not be independent within each class, it is convenient and we obtain pretty decent results when the n is small, p is large. 

3. It reduces variance, though it has some bias (Bias-variance trade-off)


**Options to estimate the one-dimensional density function fkj using training data**

1. [For Quantitative $X_j$] -> We assume $X_j |Y = k ∼ N(μ_{jk},σ_{jk}^2)$, where within each class, the jth predictor is drawn from a (univariate) normal distribution. It is **QDA-like with diagonal class-specific covariance matrix**

2. [For Quantitative $X_j$] -> Use a *non-parametric estimate* for $f_{kj}$. First, a histogram for the within-class observations and then estimate $f_{kj}(x_j)$. Or else, use **kernel density estimator**. 

3. [For Qualitative $X_j$] ->Count the proportion of training observations for the jth predictor corresponding to each class. 

Note: Fixing the threshold, the Naive Bayes has a higher error rate than LDA, but better prediction (higher sensitivity). 


## Summary of the classification methods

### An Analytical Comparison

- **LDA** and **logistic regression** assume that the log odds of the posterior probabilities is _linear_ in x.

- **QDA** assumes that the log odds of the posterior probabilities is _quadratic_ in x.

- **LDA** is simply a restricted version of QDA with $Σ_1 = · · · = Σ_K = Σ$

- **LDA** is a special case of naive Bayes and vice-versa!

- **LDA** assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes instead assumes _independence_ of the features.

- **Naive Bayes** can produce a more _flexible_ fit. 

- **QDA** might be more accurate in settings where interactions among the predictors are important in discriminating between classes. 

- **LDA > logistic regression** when the observations at each Kth class is normal. 

- **K-nearest neighbors (KNN)** will be better classifiers when decision boudary is non-linear, n is large, and p is small. 

- **KNN** has low bias but large variance; as such, KNN requires a lot of observations relative to the number of predictors. 

- If decision boundary is non-linear but n is and p are small, then QDA may be preferred to KNN. 

- KNN does not tell us which predictors are important!


<br>

_Final note._ The choice of method depends on (1) the true distribution of the predictors in each of the K classes,(2) the values of n and p - bias-variance trade-off


### An Empirical Comparison

```{r fig4-11, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the linear scenarios described in the main text."}

knitr::include_graphics("./images/fig4_11.jpg", error = FALSE)

```


**When Bayes decision boundary is linear,**

_Scenario 1_: Binary class response, equal observations in each class, uncorrelated predictors

_Scenario 2_: Similar to Scenario 1, but the predictors had a correlation of −0.5.

_Scenario 3_: Predictors had a negative correlation, t-distribution (more extreme points at the tails)

```{r fig4-12, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Boxplots of the test error rates for each of the non-linear scenarios described in the main text"}

knitr::include_graphics("./images/fig4_12.jpg", error = FALSE)

```


**When Bayes decision boundary is non-linear,**

_Scenario 4_: normal distiibution, correlation of 0.5 between the predictors in the first class, and correlation of −0.5 between the predictors in the second class.

_Scenario 5_: Normal distribution, uncorrelated predictors

_Scenario 6_: Normal distribution, different diagonal covariance matrix for each class, small n


## Generalized Linear Models

- For count data 

```{r fig4-13, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "A least squares linear regression model was fit to predict bikers in the Bikeshare data set. Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight."}

knitr::include_graphics("./images/fig4_13.jpg", error = FALSE)

```


```{r fig4-14, cache=FALSE, echo=FALSE, fig.align="center", fig.cap= "Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. Jitter was applied for ease of visualization. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers. A smoothing spline fit is shown in green. Right: The log of the number of bikers is now displayed on the y-axis."}

knitr::include_graphics("./images/fig4_14.jpg", error = FALSE)

```


**Problems: **
1. LM predicts a negative number of users in ~10% of the dataset

2. Heteroscedasticity of the data

3. Nature of response is integer-value (i.e., count)

**Potential solution:** 

1. Log-transform the data -> difficulty in intepreting meaningful estimates

2. Poisson regression model

### Poisson Regression Model

- The Poisson distribution is typically used to model counts, able to take on nonnegative integer values

Some important distinctions between the Poisson regression model and
the linear regression: 

1. **Interpretation:** An increase in Xj by one unit is associated with a change in E(Y) = λ by a factor of exp(βj )

2. **Mean-variance relationship** We implicitly assume that mean bike usage in a given hour equals the variance of bike usage during that hour; we use constant variance in linear regression

3. **Nonnegative fitted values** There are no negative predictions using the Poisson regression model.

## Lab: Classification Methods

## Meeting Videos

### Cohort 1

`r knitr::include_url("https://www.youtube.com/embed/URL")`

<details>
<summary> Meeting chat log </summary>

```
ADD LOG HERE
```
</details>