The value of each output neuron can be calculated as the following:
$$
y_{j}=b_{j}+\sum_{i}x_{i}w_{i j}
$$
With matrices, we can compute this formula for every output neuron using a dot product :
$$
X=[x_{1}\ \ \cdots\ \ x_{i}] \quad \quad W=\begin{bmatrix}
w_{11} & \cdots & w_{1i} \\
\vdots & \ddots & \vdots \\
w_{j1} & \cdots & w_{ji}
\end{bmatrix} \quad \quad B=\left[b_{1}\ \ \cdots\ \ b_{j}\right]
$$
$$
Y = XW + B
$$
First, we need to compute the derivative of the error with respect to the input (∂E/∂X). This will be the ∂E/∂Y for the layer before that one.
$$
\frac{\partial E}{\partial X}=\left[\frac{\partial E}{\partial x_{1}}\quad\frac{\partial E}{\partial x_{2}}\quad\ ...\quad\frac{\partial E}{\partial x_{i}}\right]
$$
Using the chain rule:
$$
{\frac{\partial E}{\partial x_{i}}}={\frac{\partial E}{\partial y_{1}}}{\frac{\partial y_{1}}{\partial x_{i}}}+\dots+{\frac{\partial E}{\partial y_{j}}}{\frac{\partial y_{j}}{\partial x_{i}}}
$$
$$
=\frac{\partial E}{\partial y_{1}}w_{i1}+\ldots+\frac{\partial E}{\partial y_{j}}w_{i j}
$$
We can then write the whole matrix:
$$
\frac{\partial E}{\partial X}=\left[(\frac{\partial E}{\partial y_{1}}w_{11}+\dots+\frac{\partial E}{\partial y_{j}}w_{1j})\right.\ \dots\ \ \left.(\frac{\partial E}{\partial y_{1}}w_{i1}+\dots+\frac{\partial E}{\partial y_{j}}w_{i j})\right]
$$
$$
=\left[{\frac{\partial E}{\partial y_{1}}}\quad ... \quad{\frac{\partial E}{\partial y_{j}}}\right] =\left[\frac{\partial E}{\partial y_{1}} ... \frac{\partial E}{\partial y_{j}}\right]
\begin{bmatrix}
w_{11} & \cdots & w_{1i} \\
\vdots & \ddots & \vdots \\
w_{j1} & \cdots & w_{ji}
\end{bmatrix}
$$
$$
={\frac{\partial E}{\partial Y}}W^{t}
$$
To update the network weights, we need the error derivative with respect to every weight:
$$
\frac{\partial E}{\partial W}=
\begin{bmatrix}
\frac{\partial E}{\partial w_{11}} & \cdots & \frac{\partial E}{\partial w_{1j}} \\
\vdots & \ddots & \vdots \\
\frac{\partial E}{\partial w_{i1}} & \cdots & \frac{\partial E}{\partial w_{ij}}
\end{bmatrix}
$$
Using the chain rule:
$$
{\frac{\partial E}{\partial w_{i j}}}={\frac{\partial E}{\partial y_{1}}}{\frac{\partial y_{1}}{\partial w_{i j}}}+\dots+{\frac{\partial E}{\partial y_{j}}}{\frac{\partial y_{j}}{\partial w_{i j}}}
$$
$$
{}={\frac{\partial E}{\partial y_{j}}}x_{i}
$$
Therefore,
$$
\frac{\partial E}{\partial W}=
\begin{bmatrix}
{\frac{\partial E}{\partial y_{1}}}x_{1} & \cdots & {\frac{\partial E}{\partial y_{j}}}x_{1} \\
\vdots & \ddots & \vdots \\
{\frac{\partial E}{\partial y_{1}}}x_{i} & \cdots & {\frac{\partial E}{\partial y_{j}}}x_{i}
\end{bmatrix}
$$
$$
{}=
\begin{bmatrix}
x_{1} \\
\vdots \\
x_{i}
\end{bmatrix}
\left[{\frac{\partial E}{\partial y_{1}}}\quad\quad...\quad\quad{\frac{\partial E}{\partial y_{j}}}\right]
$$
$$
=X^t\frac{\partial E}{\partial Y}
$$
Now for the biases (one gradient per bias):
$$
{\frac{\partial E}{\partial B}}=\left[{\frac{\partial E}{\partial b_{1}}}\quad{\frac{\partial E}{\partial b_{2}}}\quad...\quad{\frac{\partial E}{\partial b_{j}}}\right]
$$
Again, using the chain rule:
$$
{\frac{\partial E}{\partial b_{j}}}={\frac{\partial E}{\partial y_{1}}}{\frac{\partial y_{1}}{\partial b_{j}}}+\dots+{\frac{\partial E}{\partial y_{j}}}{\frac{\partial y_{j}}{\partial b_{j}}}
$$
$$
={\frac{\partial E}{\partial y_{j}}}
$$
Therefore:
$$
\frac{\partial E}{\partial B}=\left[\frac{\partial E}{\partial y_{1}}\quad\frac{\partial E}{\partial y_{2}}\quad...\quad\frac{\partial E}{\partial y_{j}}\right]
$$
$$
={\frac{\partial E}{\partial Y}}
$$
Finally, we have the three formulas that we need for the backward propagation:
$$
\frac{\partial E}{\partial X}=\frac{\partial E}{\partial Y}W^{t}
$$
$$
\frac{\partial E}{\partial W}=X^{t}\frac{\partial E}{\partial Y}
$$
$$
\frac{\partial E}{\partial B}=\frac{\partial E}{\partial Y}
$$
To update the weights:
$$
w_{i}\leftarrow w_{i}-\alpha\frac{\partial E}{\partial w_{i}}
$$
To update the biases:
$$
b_{i}\leftarrow b_{i}-\alpha\frac{\partial E}{\partial b_{i}}
$$
$$
f(x)=
\begin{cases}
0& \text{if } x < 0\\
x& \text{if } x\geq 0
\end{cases}
$$
$$
f'(x)=
\begin{cases}
0& \text{if } x < 0\\
1& \text{if } x\geq 0
\end{cases}
$$
$$
f(x)={\frac{1}{1+e^{-x}}}
$$
$$
f'(x)=f(x)(1-f(x))^{2}
$$
$$
f(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$
$$
f'(x)=1-f(x)^2
$$
$$
f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$
$$
f'(x)=f(x)(1-f(x))
$$
$$
L=\frac{1}{n}\sum_{i}^{n}\bigl(y_{i}-y_{i}^{*}\bigr)^{2}
$$
$$
{\frac{\partial L}{\partial Y}}=\left[{\frac{\partial L}{\partial y_{1}}}\quad...\quad{\frac{\partial L}{\partial y_{i}}}\right]
$$
$$
{}=\frac{2}{n}[y_{1}^* - y_1 \quad ... \quad y_{i}^* - y_i]
$$
$$
{}=\frac{2}{n}(Y^* - Y)
$$
$$
L = -\frac{1}{n} \sum_{i=1}^{n} (y_{i} \log(y_{i}^{*}) + (1 - y_{i}) \log(1 - y_{i}^{*}))
$$
$$
{\frac{\partial L}{\partial Y}} = {\frac{-Y}{Y^*}}-\left({\frac{1-Y}{1-{Y^*}}}*-1\right)
$$
$$
{}={\frac{-Y}{Y^*}}+{\frac{1-Y}{1-Y^*}}
$$
CategoricalCrossEntropy Loss:
$$
L=-\sum_{i=1}^{i=N}y_{i}\log(y_{i}^*)
$$
$$
{\frac{\partial L}{\partial Y}} = - \frac{Y}{Y^*}
$$
$$
V_{t}=\beta V_{t-1}+\left(1-\beta\right)\frac{\partial L}{\partial W}
$$
$$
W_t=W_{t-1}-\alpha V_{t}
$$
$$
m_{t}=\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)\frac{\partial L}{\partial W}
$$
$$
v_{t}=\beta_{2}v_{t-1}+\left(1-\beta_{2}\right)\frac{\partial L}{\partial W}^2
$$
$$
\hat{m_t}=\frac{m_{t}}{1-\beta_{1}^t}
$$
$$
\hat{v_t}=\frac{v_{t}}{1-\beta_{2}^t}
$$
$$
W_t = W_t - \alpha \frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon}
$$
$$
a^{(t)} = W h^{(t-1)}+U x^{(t)}
$$
$$
h^t = activation(a^{(t)})
$$
$$
o^{(t)}=Vh^{(t)}
$$