A feedforward neural network is an neural network where connections between layers are in one direction.
The outputs of a layer become inputs (x) to the next.
Coincidentally, matrix math also establishes a one-way relationship.
Consider the following, which multiplies a layer by its weights that map to the next layer.

f \left(
\underbrace{\begin{bmatrix}
1 & x_{i,1} & \cdots & x_{i,n}
\end{bmatrix}_{(1 \times (n+1))}}_{layer_{i}}
\cdot
\underbrace{\begin{bmatrix}
w_{0,0} & \cdots & w_{0,m-1} \\
\vdots & \ddots & \vdots \\
w_{n,0} & \cdots & w_{n,m-1}
\end{bmatrix}_{((n+1) \times m)}}_{weights_{i \rightarrow i+1}}
\right)_{(1 \times m)}

=
\underbrace{\begin{bmatrix}
x_{i+1,1} & \cdots & x_{i+1,m}
\end{bmatrix}_{(1\times m)}}_{layer_{i+1}}

The goal of gradient descent is to calculate an update to the weights,
such that the next calculation of the network will result in a lower error when estimating the output.
In other words, we aim to calculate the change in error with respect to a change in weights (allowing us to know how much to change the weights).

\frac{\partial L}{\partial w_{i \rightarrow i+1}}

Since the product of layer times weights is going to be used a lot, we can simplify it by calling it z.

z_{i(1 \times m)}=\underbrace{\begin{bmatrix}
1 & x_{i,1} & \cdots & x_{i,n}
\end{bmatrix}_{(1 \times (n+1))}}_{layer_{i}}
\cdot
\underbrace{\begin{bmatrix}
w_{0,0} & \cdots & w_{0,m-1} \\
\vdots & \ddots & \vdots \\
w_{n,0} & \cdots & w_{n,m-1}
\end{bmatrix}_{((n+1) \times m)}}_{weights_{i \rightarrow i+1}}

This notation can be summarized as follows:

z = x \cdot w

such that we revisit the matrix differentiation rules:

\frac{\partial L}{\partial w} = x^T \frac{\partial L}{\partial z}

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial z}^T \cdot w

\frac{\partial z}{\partial x} = w^T

\frac{\partial z}{\partial w} = x

This is important when we consider the function f(z).
This could be a number of different functions - typically of the sigmoid family.
logistic, hyperbolic tangent, arctangent, etc.
Sigmoid functions range from 0 to 1 across all inputs (not actually reaching 0 or 1 in practice) and are both reversible and differentiable.
One such sigmoid function is the logistic function:

f \left(z \right ) = \frac{1}{1 + e^{-z}}

which can be differentiated as follows

f' \left(z \right ) = \frac{\partial }{\partial z}f(z) = \frac{e^{-z}}{\left(1 + e^{-z} \right )^2}

and expanded

\frac{e^{-z}}{\left(1 + e^{-z} \right )^2} = \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}

further expands to

\frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \frac{1}{1 + e^{-z}} \cdot \left(1 - \frac{1}{1 + e^{-z}} \right)

which simplifies to

\frac{1}{1 + e^{-z}} \cdot \left(1 - \frac{1}{1 + e^{-z}} \right) = f(z) \cdot \left(1- f(z) \right )

such that

f'(z)_{(1 \times m)} = f(z)_{(1 \times m)} * \left(1- f(z) \right )_{(1 \times m)}

where * is the Schur/Hadamard product.
This is the product of each element - not the matrix product.
In the same manner, the logistic function can be inverted using.

f^{-1} \left(z \right ) = ln \left(\frac{1}{z} - 1\right)

This, however, assumes that x never reaches 0 or 1 exactly.

We now think about what the network hopes to minimize, and that is error/residual/loss.
The loss function is often defined as the 1/2 residual sum squared.

L = \frac{1}{2}\cdot \left( \underbrace{\tilde{x}_{(1 \times m)}}_{measured} - \underbrace{\hat{x}_{(1 \times m)}}_{estimated} \right )^2

Such that it can be conveniently differentiated as follows:

\frac{\partial L}{\partial x} = \underbrace{\hat{x}_{(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{(1 \times m)}}_{measured}

Since we also know that error is driven by the estimates,

\frac{\partial}{\partial z}\hat{x} = \frac{\partial}{\partial z}f(z) = f'(z) = f(z) * (1-f(z))

the chain rule can then be applied

\frac{\partial}{\partial z}L =\frac{\partial L}{\partial \hat{x}} \frac{\partial \hat{x}}{\partial z} =
\left(\underbrace{\hat{x}_{i_{(1 \times m)}}}_{estimated} - \underbrace{\tilde{x}_{i_{(1 \times m)}}}_{measured} \right ) * f'(z_{i-1})_{(1 \times m)}

and substituting in we see

\frac{\partial L}{\partial w_{i \rightarrow i+1}} = x^T_{i} \cdot \frac{\partial L}{\partial z_i}

=
\begin{bmatrix}
1 \\
x_i,1 \\
\vdots \\
x_i,n
\end{bmatrix}_{((n + 1) \times 1)}
\cdot
\begin{bmatrix}
\left(\underbrace{\hat{x}_{i+1,(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{i+1,(1 \times m)}}_{measured} \right )
*
f' \left(z_i_{(1 \times m)} \right )_{(1 \times m)}
\end{bmatrix}_{(1 \times m)}

=
\begin{bmatrix}
1 \\
x_i,1 \\
\vdots \\
x_i,n
\end{bmatrix}_{((n + 1) \times 1)}
\cdot
\begin{bmatrix}
\left(\underbrace{\hat{x}_{i+1,(1 \times m)}}_{estimated} - \underbrace{\tilde{x}_{i+1,(1 \times m)}}_{measured} \right )
*
f' \left(x_i_{(1 \times (n+1))} \cdot w_{i \rightarrow i+1, ((n+1) \times m)} \right )_{(1 \times m)}
\end{bmatrix}_{(1 \times m)}

The last part is to set a learning rate such that

\delta w_{i \rightarrow i+1} += \gamma \cdot \frac{\partial L}{\partial w_{i \rightarrow i+1}}

Great, but what about the weights of the previous layers?
First we need to know how L varies with x.

\frac{\partial L}{\partial x_{i}} = \frac{\partial L}{\partial z_{i}} \frac{\partial z_{i}}{\partial x_{i}}

and from the above rules of differentiation

\frac{\partial L}{\partial z_{i}}\frac{\partial z_{i}}{\partial x_{i}} = \frac{\partial L}{\partial z_{i}} \cdot w^T_{i \rightarrow i+1}

Now, again, we can use the chain rule

\frac{\partial L}{\partial w_{i-1 \rightarrow i}} = \frac{\partial L}{\partial x_{i}} \frac{\partial x_i}{\partial w_{i-1 \rightarrow i}}

and we know

\frac{\partial x_i}{\partial w_{i-1 \rightarrow i}} = x^T_{i-1} \cdot \frac{\partial L}{\partial z_{i-1}}

where

\frac{\partial L}{\partial z_{i-1 }} = \frac{\partial L}{\partial x_{i}} * \frac{\partial x_{i}}{\partial w_{i-1 \rightarrow i}} =\frac{\partial L}{\partial x_{i}} * f'(z_{i-1})

so now

\frac{\partial L}{\partial w_{i-1 \rightarrow i}} =
x^T_{i-1} \cdot
\left(
\underbrace{\left(
\left(
\left(\hat{x}_{i+1} - \tilde{x}_{i+1} \right ) * f'(x_i \cdot w_{i \rightarrow i+1}) \right) \cdot w^T_{i \rightarrow i+1}
\right )}_{\frac{\partial L}{\partial x_{i}}} *
f'(x_{i-1} \cdot w_{i-1 \rightarrow i})
\right )

This is what is referred to as back-propagation.
It is performed from the final layer back to the initial layer before the new residual is calculated.