Inference

Inference is the process of using a trained machine learning model to make predictions or decisions based on new, unseen data.

🌐 Terms in other languages:

English Deutsch Español Français 日本語 한국어 Polski Português Русский Türkçe Українська

Metrics: TTFT (Time to First Token), Tokens per second. Optimization: Distillation, Quantization, KV Caching.

        graph LR
  Center["Inference"]:::main
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

🕸️ Open in Universe

🧒 Explain Like I'm 5

Training is when you teach a chef how to cook. [Inference](/en/terms/inference) is the chef actually cooking a meal for you. The chef isn't learning new recipes while cooking; he's just using the skills he already has to make your dinner.

🤓 Expert Deep Dive

Technically, inference is a 'Forward Pass' through the neural network. Unlike training, it doesn't involve 'Backpropagation' (adjusting weights). The goal of inference is to maximize 'Throughput' (requests per second) and minimize 'Latency'. To make this efficient, developers use 'Quantization' (reducing the precision of numbers from 32-bit to 8-bit) and 'Model Pruning' (removing neurons that don't contribute much to the result). High-performance inference often runs on specialized hardware like TPUs, ASICs, or optimized 'Inference Engines' like TensorRT or ONNX Runtime.

🧒 Explain Like I'm 5

🤓 Expert Deep Dive

📚 Sources