Reinforcement Learning

강화 학습(RL)은 에이전트가 환경에서 결정을 내려 보상 신호를 최대화하도록 학습하는 기계 학습 패러다임입니다.

🌐 용어 다른 언어로:

English Deutsch Español Français 日本語 한국어 Polski Português Русский Türkçe Українська

RL은 에이전트가 환경과 상호 작용하고, 행동을 취하고, 보상 또는 페널티 형태로 피드백을 받는 것을 포함합니다. 에이전트의 목표는 시간이 지남에 따라 누적 보상을 최대화하는 행동을 선택하기 위한 전략인 정책을 배우는 것입니다. 이 학습 프로세스는 종종 에이전트의 행동이 환경의 상태에 영향을 미치고 환경이 이러한 상태 전환을 기반으로 보상을 제공하는 마르코프 의사 결정 프로세스(MDP)로 모델링됩니다.

RL 알고리즘은 시행 착오를 통해 환경을 탐색하고 점차적으로 정책을 개선합니다. 이 탐색-활용 트레이드 오프는 에이전트가 새로운 행동을 시도하는 것(탐색)과 이미 얻은 지식을 활용하는 것(활용)의 균형을 맞춰야 하므로 매우 중요합니다. Q-learning, SARSA 및 정책 기울기와 같은 다양한 알고리즘이 RL 에이전트를 훈련하는 데 사용됩니다. 이러한 알고리즘은 수신된 보상을 기반으로 에이전트의 정책 또는 가치 함수를 업데이트하여 최적의 동작으로 안내합니다.

        graph LR
  Center["Reinforcement Learning"]:::main
  Pre_machine_learning["machine-learning"]:::pre --> Center
  click Pre_machine_learning "/terms/machine-learning"
  Rel_deep_learning["deep-learning"]:::related -.-> Center
  click Rel_deep_learning "/terms/deep-learning"
  Rel_game_theory["game-theory"]:::related -.-> Center
  click Rel_game_theory "/terms/game-theory"
  Rel_logistic_regression["logistic-regression"]:::related -.-> Center
  click Rel_logistic_regression "/terms/logistic-regression"
  classDef main fill:#7c3aed,stroke:#8b5cf6,stroke-width:2px,color:white,font-weight:bold,rx:5,ry:5;
  classDef pre fill:#0f172a,stroke:#3b82f6,color:#94a3b8,rx:5,ry:5;
  classDef child fill:#0f172a,stroke:#10b981,color:#94a3b8,rx:5,ry:5;
  classDef related fill:#0f172a,stroke:#8b5cf6,stroke-dasharray: 5 5,color:#94a3b8,rx:5,ry:5;
  linkStyle default stroke:#4b5563,stroke-width:2px;

🕸️ Open in Universe

🧠 지식 테스트

1 / 3

🧒 5살도 이해할 수 있게 설명

🎮 Training a computer program like a puppy: rewarding good behavior and ignoring bad behavior until it learns to be helpful.

🤓 Expert Deep Dive

## RLHF: Aligning Human and Machine
Reinforcement Learning from Human Feedback (RLHF) is the secret sauce behind modern chatbots like ChatGPT. Since it's impossible to write a mathematical formula for 'a good, helpful answer,' we show the model pairs of answers and let humans rank them. An RL agent is then trained to predict these rankings, creating a 'Reward Model' that guides the LLM toward safe and helpful output.