Gradient Descent

A first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Used widely in regression, neural networks, and logistics, Gradient Descent is the engine that allows AI models to learn from data. The effectiveness of the algorithm depends heavily on the 'Learning Rate' hyperparameter; if it's too high, the model may oscillate and fail to converge; if too low, training becomes computationally expensive. Modern deep learning frameworks (TensorFlow, PyTorch) automate the calculation of these gradients using automatic differentiation.

        graph LR
  Center["Gradient Descent"]:::main
  Rel_backpropagation["backpropagation"]:::related -.-> Center
  click Rel_backpropagation "/terms/backpropagation"
  Rel_neural_network["neural-network"]:::related -.-> Center
  click Rel_neural_network "/terms/neural-network"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

      

🧒 Explain Like I'm 5

🌍 Imagine you are standing on a foggy mountain and want to find the very bottom of the valley. Since you can't see the bottom, you feel the ground with your feet and take a step in whichever direction goes down most steeply. You keep doing this until the ground is flat, meaning you've reached the bottom.

🤓 Expert Deep Dive

Gradient Descent updates parameters (weights) by subtracting the product of the learning rate (eta) and the gradient of the loss function relative to those parameters. In deep learning, this is combined with Backpropagation to propagate errors through layers. Variants like Stochastic Gradient Descent (SGD) introduce randomness by using a single sample per step, which helps escape local minima and saddle points. Advanced optimizers like Adam or RMSProp dynamically adjust the learning rate for each parameter using momentum and squared gradients for faster, more stable convergence.

📚 Sources