Inference
Inference is the process of using a trained machine learning model to make predictions or decisions based on new, unseen data.
Metrics: TTFT (Time to First Token), Tokens per second. Optimization: Distillation, Quantization, KV Caching.
graph LR
Center["Inference"]:::main
classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
linkStyle default stroke:#4b5563,stroke-width:2px;
🧒 Explain Like I'm 5
Training is when you teach a chef how to cook. [Inference](/en/terms/inference) is the chef actually cooking a meal for you. The chef isn't learning new recipes while cooking; he's just using the skills he already has to make your dinner.
🤓 Expert Deep Dive
Technically, inference is a 'Forward Pass' through the neural network. Unlike training, it doesn't involve 'Backpropagation' (adjusting weights). The goal of inference is to maximize 'Throughput' (requests per second) and minimize 'Latency'. To make this efficient, developers use 'Quantization' (reducing the precision of numbers from 32-bit to 8-bit) and 'Model Pruning' (removing neurons that don't contribute much to the result). High-performance inference often runs on specialized hardware like TPUs, ASICs, or optimized 'Inference Engines' like TensorRT or ONNX Runtime.