Backpropagation

Backpropagation is the core algorithm for training artificial neural networks. It efficiently calculates the gradient of the loss function concerning the network's weights and biases, enabling iterative weight updates to minimize prediction errors.

Backpropagation, short for 'backward propagation of errors,' is the fundamental algorithm enabling artificial neural networks to learn. It computes the gradient of a defined loss function with respect to every weight and bias within the network. This gradient information is then fed into an optimization algorithm (e.g., gradient descent) to systematically adjust the network's parameters. The goal is to minimize the discrepancy between the network's predictions and the actual target values. The process inherently involves two passes: a forward pass to generate predictions and a backward pass to propagate error signals and calculate gradients.

Core Steps:

  1. Forward Pass: Input data traverses the network, layer by layer. Each neuron applies a weighted sum of its inputs, adds a bias, and passes the result through an activation function to produce an output.
  2. Loss Computation: A loss function measures the error between the network's final output and the ground truth.
  3. Backward Pass (Gradient Computation): Starting from the output layer, the error is propagated backward. Using the chain rule of calculus, the algorithm calculates how much each weight and bias contributed to the overall error (i.e., the gradient).
  4. Parameter Update: An optimization algorithm utilizes these computed gradients to adjust the weights and biases, moving them in a direction that reduces the loss.

This cycle repeats, allowing the network to progressively refine its internal parameters and improve its predictive accuracy.

        graph LR
  Center["Backpropagation"]:::main
  Pre_linear_algebra["linear-algebra"]:::pre --> Center
  click Pre_linear_algebra "/terms/linear-algebra"
  Rel_gradient_descent["gradient-descent"]:::related -.-> Center
  click Rel_gradient_descent "/terms/gradient-descent"
  Rel_deep_learning["deep-learning"]:::related -.-> Center
  click Rel_deep_learning "/terms/deep-learning"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

      

🧠 Knowledge Check

1 / 5

🧒 Explain Like I'm 5

Think of training a [neural network](/en/terms/neural-network) like teaching a student a difficult subject. The student tries to answer a question (forward pass), gets a score (loss), and you tell them exactly *how* they were wrong and by how much for each concept they used (backward pass). They then adjust their understanding of those concepts (weight update) to do better next time. Backpropagation is this precise feedback mechanism for AI, allowing it to learn by correcting its own errors.

🤓 Expert Deep Dive

Backpropagation is an efficient algorithm for computing the gradient $
abla_{\theta} L(\theta; D)$ of a loss function $L$ with respect to the parameters $\theta$ (weights $W$ and biases $b$) of a neural network, given a dataset $D = \{(x^{(i)}, y^{(i)})\}$. It utilizes the chain rule of calculus to decompose the gradient computation into a sequence of local operations performed layer by layer. Let $a^{(l)}$ denote the activation vector of layer $l$, and $z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)}$ be the pre-activation vector. The output of layer $l$ is $a^{(l)} = f(z^{(l)})$, where $f$ is the activation function. The loss $L$ is typically a function of the final output $a^{(L)}$ and the target $y$. The backward pass computes an error term $\delta^{(l)}$ for each layer $l$, starting from the output layer. For the output layer $L$, $\delta^{(L)} = \frac{\partial L}{\partial a^{(L)}} \odot f'(z^{(L)})$, where $\odot$ denotes element-wise multiplication and $f'$ is the derivative of the activation function. For hidden layers $l < L$, the error term is computed recursively: $\delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot f'(z^{(l)})$. The gradients of the loss with respect to the parameters of layer $l$ are then derived as $
abla_{W^{(l)}} L = \delta^{(l)} (a^{(l-1)})^T$ and $
abla_{b^{(l)}} L = \delta^{(l)}$. These gradients are essential inputs for optimization algorithms like Stochastic Gradient Descent (SGD) or Adam, which update $\theta$ to minimize $L$.

🔗 Related Terms

Prerequisites:

📚 Sources