PPO Algorithm (for the Leatherback Project)
Paper: https://arxiv.org/pdf/1707.06347
Intro: High-Level Overview of Reinforcement Learning and Its Approaches
1. Reinforcement Learning (RL)
Reinforcement Learning is a framework where an agent learns to make decisions by interacting with an environment. The core idea is to learn a policy—a strategy for choosing actions—by maximizing the cumulative reward over time. Key aspects include:
- Trial and Error: The agent explores the environment and learns from the consequences of its actions.
- Reward-Based Learning: Instead of relying on labeled data, the agent receives rewards (or penalties) based on the outcomes of its actions.
- Exploration vs. Exploitation: The agent must balance trying new actions (exploration) and leveraging known rewarding actions (exploitation).
RL can be divided into two broad categories:
- Model-Free RL: The agent learns directly from interactions without an internal model of the environment. - PPO, Q-Learning
- Model-Based RL: The agent builds a model of the environment to plan its actions.
2. Approaches to Solving RL Problems
Value-Based Methods
These methods focus on estimating the value (or quality) of states or state-action pairs.
- Examples: Q-Learning, Deep Q-Networks (DQN).
- Key Idea: Learn a value function that predicts future rewards, then derive the policy indirectly by choosing actions that maximize this value.
Policy Gradient Methods
Policy gradient methods take a direct approach by optimizing the policy itself.
- Direct Optimization: They adjust policy parameters by computing gradients of a performance measure (often using the likelihood ratio trick).
- Advantage Estimation: They use estimates of the advantage—how much better an action is compared to the average—to guide updates.
- Flexibility: These methods naturally handle continuous and stochastic action spaces.
Actor-Critic Methods
These methods blend the strengths of both value-based and policy gradient approaches.
- Actor: The component that directly updates the policy.
- Critic: The component that evaluates the current policy by estimating value functions.
- Benefit: This combination often leads to more stable and efficient learning.
3. Proximal Policy Optimization (PPO)
PPO is a modern algorithm that builds on policy gradient methods while addressing some of their stability challenges. Key features include:
- Data Collection: The agent gathers experiences by interacting with the environment, generating trajectories of states, actions, and rewards.
- Surrogate Objective: Instead of optimizing the true expected reward (which is difficult to compute), PPO optimizes a surrogate objective. This objective uses the ratio of new to old policy probabilities, weighted by an advantage estimate.
- Proximal Updates: PPO employs a clipping mechanism that limits the size of policy updates. This ensures that the new policy remains "proximal" or close to the current policy, preventing drastic changes that could destabilize learning.
Quick Intro to the Leatherback Project
Leatherback - Community Project Leatherback - Community Project
python scripts\reinforcement_learning\skrl\train.py --task Isaac-Leatherback-Direct-v0 --num_envs 32
tensorboard --logdir=logs\skrl\leatherback_direct\2025-03-24_17-22-19_ppo_torch
- 4 Revolute Joints for the Wheels (moving forward-backward)
- 2 Revolute Joints for the Knuckle (steering left-right)
PPO - Neural Networks: Actor - Critic
Policy Network (Actor) - “The Decision Maker”
Observations from the Environment
The Policy Network takes in 8 observations:
- Position Error
- Cosine of Target Heading Error
- Sine of Target Heading Error
- Linear Velocity X
- Linear Velocity Y
- Angular Velocity Z
- Current Throttle State
- Current Steering State
These observations are fed into our neural network’s shared layers.
Action Space
After processing the observations, the network outputs 2 continuous action values:
Throttle Control
- Purpose: Controls the speed of all four wheels.
Steering Control
- Purpose: Controls the steering angle of the front wheels.
Gaussian Policy Distribution
Instead of directly outputting actions, our Policy Network provides parameters for a Gaussian distribution for each action:
- Mean (μ): The network's "best guess" for each action.
Example:
- For throttle: μ_throttle (e.g., +0.3) = Small movement forward (positive)
- For steering: μ_steering (e.g., -0.2) = Small angular movement to the left (negative)
- Log Standard Deviation (log_std): A learnable parameter for each action.
- One for throttle: Controls exploration in speed.
- One for steering: Controls exploration in turning.
Example Calculation
- Network Output:
- Learnable Parameter:
- Action Sampling: action ∼ N(μ,σ^2)
The network outputs a mean (μ) for each action.
The log_std is learned and later exponentiated to yield the standard deviation (σ).
The final action is sampled from a Gaussian distribution defined by μ and σ (mean and log_std output).
Exploration vs. Exploitation
- Early Training:
- Higher log_std values (e.g., 0.0 → σ = 1.0)
- Effect: More exploration; the robot tries varied combinations of speed and steering to discover effective actions.
- Late Training:
- Lower log_std values (e.g., -2.0 → σ ≈ 0.135)
- Effect: More exploitation; the robot makes precise, confident movements, crucial for accurate waypoint navigation.
Value Network (Critic) - “The Situation Evaluator”
Input & Output
- Input: Same 8 observations as the Policy Network.
- Output: A single scalar value,
V(s)
, estimating the expected sum of future rewards.
Reward Components Considered
- Position Progress Reward: Weighted by 1.0.
- Heading Alignment Reward: Weighted by 0.05.
- Goal Reached Bonus: 10.0
Example:
- If V(s) = 15, the network predicts:
- Likely reaching at least one waypoint (bonus of 10).
- Approximately 5 units accumulated from progress/heading rewards.
PPO Update Process
1. Data Collection Phase
During each rollout, the agent collects experiences by interacting with the environment:
- Observations
- Action Sampling:
- Environment Interaction:
- Storage:
The Policy Network uses these observations to sample actions and returns the log probabilities of those actions.
Actions are executed in the environment, which returns the next state and a reward.
All data (observations, actions, rewards, next observations) is stored for the update phase.
Example:
For one rollout, if 32 steps are taken in 4096 parallel environments, you collect a total of 32×4096 = 131,072 samples.
As in the code decimation
is set to 4, it means that each steps includes 4 physics steps.
Important:
The data collection uses the current policy parameters but DOESN'T update them during collection. This separation of collection and update is key to PPO's stability.
2. Epochs & Mini-Batches in PPO
After each rollout (collecting 131,072 samples), PPO processes the data as follows:
- Mini-Batches:
- Epochs:
The data is split into 8 mini-batches (each with ~16,384 samples) to improve stability and reduce memory usage.
The same data is reused for 8 epochs (8 passes over each mini-batch).
Result: 8 epochs × 8 mini-batches = 64 updates per rollout.
3. Temporal Difference (TD) Error
Formula:
δₜ = rₜ + γ V(sₜ₊₁) − V(sₜ)
In plain words:
- rₜ: The immediate reward received after taking an action.
- γ:
discount_factor
: 0.99 - Sets how much future rewards are valued compared to immediate rewards. - γ V(sₜ₊₁): The predicted value of the next state, discounted by γ (gamma), which represents how much future rewards are valued.
- V(sₜ): The expected value of the current state.
Explanation:
This formula calculates the error between the outcome (immediate reward plus the discounted future potential) and the expectation. A positive δₜ means the outcome was better than expected; a negative δₜ means it was worse.
4. Advantage Estimation (using Generalized Advantage Estimation, GAE)
Formula:
Aₜ = δₜ + γλ Aₜ₊₁
In plain words:
- Aₜ: The advantage at the current timestep, showing how much better or worse an action performed compared to expectations.
- δₜ: The immediate error from the TD Error calculation.
- λ: lambda: 0.95 - Balances bias and variance in the Generalized Advantage Estimation (GAE) by controlling the contribution of future advantages.
- γλ Aₜ₊₁: A discounted sum of future advantages, where λ (lambda) determines the influence of future errors.
Explanation:
This recursive formula adds immediate and future errors to provide a balanced measure of an action’s overall performance.
Advantage Calculation
- Definition:
A = Actual_Return − V(s)
- Positive Advantage:
Indicates the action performed better than expected.
Example: A = +5 might mean a more efficient path was found.
- Negative Advantage:
Indicates the action performed worse than expected.
Example: A =−3 might suggest the robot steered too sharply and lost time.
This advantage signal helps the Policy Network update its actions relative to the baseline prediction of the Value Network.
Loss function to Update Network Parameters
5. Probability Ratio for Policy Updates
Formula:
rₜ(θ) = πₙ(aₜ|sₜ) / πₒ(aₜ|sₜ)
In plain words:
- This ratio compares how likely the new policy is to take a certain action versus the old policy.
- A ratio of 1 indicates no change in likelihood.
- A ratio greater than 1 means the new policy favors that action more, while less than 1 means it favors it less.
Explanation:
This comparison is used to measure the change in policy, which is crucial for adjusting the network without making overly large updates.
6. Clipping Mechanism in PPO
Concept:
During training, the probability ratio is clipped within a small range (typically 1 ± ε).
In plain words:
- This prevents the policy from changing too drastically in a single update.
- It ensures that updates are gradual, which helps maintain stable training.
Explanation:
By clipping the ratio, we avoid large swings that could destabilize the learning process, thereby promoting smoother policy improvements.
Example:
With ε=0.2: - If ratio > 1.2: clipped down to 1.2 - If ratio < 0.8: clipped up to 0.8 This prevents the steering or throttle actions from changing too drastically in one update.
7. Value Loss
Formula:
Value Loss = Mean Squared Error between predicted V(s) and actual returns
In plain words:
- It measures the error between what the Value Network predicts and the actual rewards received.
- The goal is to reduce this error so that the network better predicts future rewards.
Explanation:
Minimizing the value loss helps the Value Network become more accurate, which in turn provides a better baseline for policy updates.
8. Total Loss for PPO
Formula:
Total Loss = Policy Loss + (Value Coefficient × Value Loss) − (Entropy Coefficient × Entropy Bonus)
In plain words:
- Policy Loss: Encourages actions that perform better than expected by using the probability ratio and advantage.
- Value Loss: Improves the accuracy of the state-value predictions.
- Entropy Bonus: Promotes exploration by rewarding randomness in the policy.
- The coefficients balance how much each component influences the overall loss.
Explanation:
This combined loss ensures that the policy learns to select better actions while also fine-tuning the value predictions and maintaining a level of exploration necessary for robust learning.
9. Loss Update
The total loss serves as the combined objective that the PPO algorithm minimizes during training. It integrates several components so that both the policy network and the value network are improved simultaneously. Here’s what it does:
- Guides Policy Improvements:
- Enhances Value Predictions:
- Maintains Exploration:
- Balances All Components:
- Learning Rate & KL Divergence:
- KL Divergence:
- Integration with Learning Rate:
The policy loss (using the probability ratio and advantage) encourages the network to favor actions that perform better than expected.
The value loss minimizes the error between the network’s predicted state values and the actual returns, helping the critic become more accurate.
The entropy bonus rewards the network for maintaining some randomness, which prevents the policy from becoming overly deterministic too quickly.
The coefficients (for value loss and entropy bonus) control the relative importance of each component in the overall loss.
The learning rate (η) determines how large the update steps are during gradient descent. PPO uses a KL Adaptive Learning Rate Scheduler, which monitors the KL divergence—a measure of how much the new policy differs from the old one.
This is a statistical measure of the difference between two probability distributions. In PPO, a low KL divergence means the new policy remains close to the old policy, while a high KL divergence indicates a significant change.
If the KL divergence exceeds a preset threshold, the learning rate is reduced to prevent overly large updates, ensuring that the policy changes gradually. Conversely, if the KL divergence is much lower than the target, the learning rate may be increased to allow for faster learning.
By minimizing this total loss (using gradient descent) and adjusting the learning rate based on KL divergence, the PPO algorithm updates the network parameters to both maximize rewards (via the policy network) and accurately estimate future returns (via the value network), all while ensuring stable and gradual policy improvements.