_nnet.qmd

## Neural Networks with Tensorflow (by Giovanni Lunetta)

A neural network is a type of machine learning algorithm that is inspired by the structure and function of the human brain. It consists of layers of interconnected nodes, or neurons, that can learn to recognize patterns in data and make predictions or decisions based on that input.

Neural networks are used in a wide variety of applications, including image and speech recognition, natural language processing, predictive analytics, robotics, and more. They have been especially effective in tasks that require pattern recognition, such as identifying objects in images, translating between languages, and predicting future trends in data.

### Neural Network Architecture

A neural network consists of one or more layers of neurons, each of which takes input from the previous layer and produces output for the next layer. The input layer receives raw data, while the output layer produces predictions or decisions based on that input. The hidden layers in between contain neurons that can learn to recognize patterns in the data and extract features that are useful for making predictions.

Each neuron in a neural network has a set of weights and biases that determine how it responds to input. These values are adjusted during training to improve the accuracy of the network's predictions. The activation function of a neuron determines how it responds to input, such as by applying a threshold or sigmoid function.

```{python}
#| code-fold: true
from IPython.display import Image
# Image(filename='ai-artificial-neural-network-alex-castrounis.png')
```

The input layer: The three blue nodes on the left side of the diagram represent the input layer. This layer receives input data, such as pixel values from an image or numerical features from a dataset.

The hidden layer: The four white nodes in the middle of the diagram represent the hidden layer. This layer performs computations on the input data and generates output values that are passed to the output layer.

The output layer: The orange node on the right side of the diagram represents the output layer. This layer generates the final output of the neural network, which can be a binary classification (0 or 1) or a continuous value.

The arrows: The arrows in the diagram represent the connections between nodes in adjacent layers. Each arrow has an associated weight, which is a parameter learned during the training process. The weights determine the strength of the connections between the nodes and are used to compute the output values of each node.

### ReLu Activation Function

The ReLU (Rectified Linear Unit) activation function is used in neural networks to introduce non-linearity into the model. Non-linearity allows neural networks to learn more complex relationships between inputs and outputs.

ReLU is a simple function that returns the input if it is positive, and 0 otherwise. This means that ReLU "activates" (returns a non-zero output) only if the input is positive, which can be thought of as a way for the neuron to "turn on" when the input is significant enough. In contrast, a linear function would simply scale the input by a constant factor, which would not introduce any non-linearity into the model.

In simple terms, ReLU allows the neural network to selectively activate certain neurons based on the importance of the input, which helps it learn more complex patterns in the data.

```{python}
import numpy as np
import matplotlib.pyplot as plt

def linear(x):
    return x

def relu(x):
    return np.maximum(0, x)

x = np.linspace(-10, 10, 100)
y_linear = linear(x)
y_relu = relu(x)

plt.plot(x, y_linear, label='Linear')
plt.plot(x, y_relu, label='ReLU')
plt.legend()
plt.xlabel('Input')
plt.ylabel('Output')
plt.show()
```

### Demonstration

TensorFlow is an open-source software library developed by Google that is widely used for building and training machine learning models, including neural networks. TensorFlow provides a range of tools and abstractions that make it easier to build and optimize complex models, as well as tools for deploying models in production.

Here's an example of how to use TensorFlow to build a neural network for a softmax regression model:

First we start by importing the proper packages:

```{python}
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import plot_model
from tensorflow.keras.losses import SparseCategoricalCrossentropy

import numpy as np

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt
```

TensorFlow and Keras are closely related, as Keras is a high-level API that is built on top of TensorFlow. Keras provides a user-friendly interface for building neural networks, making it easy to create, train, and evaluate models without needing to know the details of TensorFlow's low-level API.

Keras was initially developed as a standalone library, but since version 2.0, it has been integrated into TensorFlow as its official high-level API. This means that Keras can now be used as a part of TensorFlow, providing a unified and comprehensive platform for deep learning.

In other words, Keras is essentially a wrapper around TensorFlow that provides a simpler and more intuitive interface for building neural networks. While TensorFlow provides a lower-level API that offers more control and flexibility, Keras makes it easier to get started with building deep learning models, especially for beginners.

```{python}
# make dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=2.0,random_state=75)

# plot the example dataset
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.title('Example Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
```

We will talk about three ways to implement a softmax regression machine learning model. 
The first using Stochastic Gradient Descent as the loss function.
Next, using a potentially more efficient algoritm called the Adam Algoritm.
Finally, using the Adam Algoritm again, but more efficiently.

#### Stochastic Gradient Descent
```{python}
sgd_model = tf.keras.Sequential([
        Dense(10, activation = 'relu'),
        Dense(5, activation = 'relu'),
        Dense(4, activation = 'softmax')    # <-- softmax activation here
    ]
)
sgd_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),  # <-- Note
)
sgd_history = sgd_model.fit(
                    X_train,y_train,
                    epochs=30
)
```

Here is a step-by-step explanation of the code:

1. First, we create a sequential model using the `tf.keras.Sequential()` function. This is a linear stack of layers where we can add layers using the `.add()` method.

2. Then we add three dense layers to the model using the `.add()` method. The first two layers have the relu activation function and the last layer has the softmax activation function.

3. We import `SparseCategoricalCrossentropy` from `tensorflow.keras.losses`. This is our loss function, which will be used to evaluate the model during training.

4. We compile the model using `model.compile()`, specifying the `SparseCategoricalCrossentropy()` as our loss function.

5. We fit the model to the training data using `model.fit()`, specifying the training data (X_train and y_train) and the number of epochs* (10).

In summary, the code creates a sequential model with three dense layers, using the relu activation function in the first two layers and the softmax activation function in the output layer. The model is then compiled using the `SparseCategoricalCrossentropy()` loss function, and finally, the model is trained for 10 epochs using the `model.fit()` method.

*In machine learning, the term "epochs" refers to the number of times the entire training dataset is used to train the model. During each epoch, the model processes the entire dataset, updates its parameters based on the computed errors, and moves on to the next epoch until the desired level of accuracy is achieved. Increasing the number of epochs may improve the model accuracy, but it also increases the risk of overfitting on the training data. Therefore, the number of epochs is a hyperparameter that must be tuned to achieve the best possible results.

```{python}
sgd_model.summary()
```

In this example, the first hidden layer has 10 neurons, so there are 10 * 3 = 30 parameters (3 input features). The second hidden layer has 5 neurons, so there are 5 * 10 + 5 = 55 parameters (10 inputs from the previous layer, plus 5 bias terms). The output layer has 4 neurons, so there are 5 * 4 + 4 = 24 parameters (5 inputs from the previous layer, plus 4 bias terms).

The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.

The None values in the output shape column represent the variable batch size that is inputted during the training process.

```{python}
p_nonpreferred = sgd_model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
```

`p_nonpreferred = model.predict(X_train)`: This line uses the predict method of the model object to make predictions on the input data X_train. The resulting predictions are stored in the p_nonpreferred variable.

`print(p_nonpreferred [:2])`: This line prints the first two rows of p_nonpreferred. Each row represents the predicted probabilities for a single observation in the training set. The four columns represent the predicted probabilities for each of the four classes in the dataset.

`print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))`: This line prints out the largest and smallest values from p_nonpreferred, which can give an idea of the range of the predictions. The np.max and np.min functions from NumPy are used to find the maximum and minimum values in p_nonpreferred.

The output is a matrix with two rows (because we have two input examples) and four columns (because the output layer has four neurons). Each element of the matrix is the probability that the input example belongs to the corresponding class. For example, the probability that the first input example belongs to class 3 (which has the highest probability) is 0.99254191.

#### ADAM Algoritm
```{python}
adam_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
adam_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001), # < change to 0.01 and rerun
)

adam_history = adam_model.fit(
                    X_train,y_train,
                    epochs=30
)
```

```{python}
adam_model.summary()
```

The None values in the output shape column represent the variable batch size that is inputted during the training process. The number of parameters in each layer depends on the number of inputs and the number of neurons in the layer, along with any additional bias terms.

In this example, the first hidden layer has 25 neurons, so there are 25 * 3 = 75 parameters (3 input features). The second hidden layer has 15 neurons, so there are 15 * 25 + 15 = 390 parameters (25 inputs from the previous layer, plus 15 bias terms). The output layer has 4 neurons, so there are 15 * 4 + 4 = 64 parameters (15 inputs from the previous layer, plus 4 bias terms).

The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.

```{python}
p_nonpreferred = adam_model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
```

Here, the only difference between the these two machine learning models is the optimizer. That line of code, optimizer=tf.keras.optimizers.Adam(0.001), specifies the optimizer to be used during training. In this case, it uses the Adam optimizer with a learning rate of 0.001. The Adam optimizer is an adaptive optimization algorithm that is commonly used in deep learning for its ability to dynamically adjust the learning rate during training, which can help prevent the model from getting stuck in local minima.

```{python}
#| code-fold: true

import numpy as np
import matplotlib.pyplot as plt

# Define the objective function (quadratic)
def objective(x, y):
    return x**2 + y**2

# Define the Adam update rule
def adam_update(x, y, m, v, t, alpha=0.1, beta1=0.9, beta2=0.999, eps=1e-8):
    g = np.array([2*x, 2*y])
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    dx = - alpha * m_hat[0] / (np.sqrt(v_hat[0]) + eps)
    dy = - alpha * m_hat[1] / (np.sqrt(v_hat[1]) + eps)
    return dx, dy, m, v

# Define the parameters for the optimization
theta = np.array([2.0, 2.0])
m = np.zeros(2)
v = np.zeros(2)
t = 0
alpha = 0.1
beta1 = 0.9
beta2 = 0.999
eps = 1e-8

# Generate the parameter space grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = objective(X, Y)

# Generate the parameter space plot
fig, ax = plt.subplots()
ax.contour(X, Y, Z, levels=30, cmap='jet')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Parameter Space of Adam')

# Perform several iterations of Adam and plot the updates
for i in range(20):
    t += 1
    dx, dy, m, v = adam_update(theta[0], theta[1], m, v, t, alpha, beta1, beta2, eps)
    theta += np.array([dx, dy])
    ax.arrow(theta[0]-dx, theta[1]-dy, dx, dy, head_width=0.1, head_length=0.1, fc='b', ec='b')
plt.show()
```

```{python}
plt.plot(sgd_history.history['loss'], label='SGD')
plt.plot(adam_history.history['loss'], label='Adam')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
```

#### Preferred ADAM Algorithm

As we have talked about in class before, numerical roundoff errors happen when coding in python due to memory overflow.

```{python}
x1 = 2.0 / 10000
print(f"{x1:.18f}") # print 18 digits to the right of the decimal point
```

```{python}
x2 = 1 + (1/10000) - (1 - 1/10000)
print(f"{x2:.18f}")
```

It turns out that while the implementation of the loss function for softmax was correct, there is a different and better way of reducing numerical roundoff errors which leads to more accurate computations.

If we go back to how a loss function for softmax regression is implemented we see that the loss function is expressed in the following formula:
$$
\text{loss}(a_1, a_2, \dots, a_n, y) =
\begin{cases}
-\log(a_1) & \text{if } y = 1 \\
-\log(a_2) & \text{if } y = 2 \\
\vdots & \vdots \\
-\log(a_n) & \text{if } y = n
\end{cases}
$$

where $a_j$ is computed from:
$$
a_j = \frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}} = P(y=j \mid \vec{x})
$$

This can lead to numerical roundoff errors in tensorflow as the loss function is not directly computing $a_j$.

In terms of code, that is exactly what `loss=SparseCategoricalCrossentropy()` is doing. Therefore, it would be more accurate if we could implement the loss function as follows:
$$
\text{loss}(a_1, a_2, \dots, a_n, y) =
\begin{cases}
-\log(\frac{e^{z_1}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 1 \\
-\log(\frac{e^{z_2}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 2 \\
\vdots & \vdots \\
-\log(\frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}}) & \text{if } y = n
\end{cases}
$$

We achieve this in two steps. The first is making the output layer a linear activation, and additionally adding a `from_logits=True` parameter to the `loss=tf.keras.losses.SparseCategoricalCrossentropy` line of code. By using a linear activation function instead of softmax, the model will output a vector of real numbers rather than probabilities.

```{python}
preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),
)

preferred_history = preferred_model.fit(
                    X_train,y_train,
                    epochs=30
)
```

```{python}
p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))
```

Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability. 

If the desired output are probabilities, the output should be be processed by a [softmax](https://www.tensorflow.org/api_docs/python/tf/nn/softmax).

```{python}
sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))
```

This code applies the softmax activation function to the output of a neural network model p_preferred, and then converts the resulting tensor to a numpy array using the `.numpy()` method. The resulting array sm_preferred contains the probabilities for each of the possible output classes for the input data.

The second line of code then prints the first two rows of sm_preferred, which correspond to the probabilities for the first two input examples in the dataset.

Lets check the loss functions one final time:

```{python}
plt.plot(adam_history.history['loss'], label='ADAM')
plt.plot(preferred_history.history['loss'], label='Pref_ADAM')
plt.plot(sgd_history.history['loss'], label='SGD')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
```

### References

1. https://www.tensorflow.org/api_docs/python/tf/nn/softmax
2. https://www.tensorflow.org/
3. https://www.whyofai.com/blog/ai-explained
4. https://www.coursera.org/specializations/machine-learning-introduction