RL Algorithm - PPO (Proximal Policy Optimization)

PPO Algorithm (for the Leatherback Project)

Paper: https://arxiv.org/pdf/1707.06347

Read embedded paper

Intro: High-Level Overview of Reinforcement Learning and Its Approaches

1. Reinforcement Learning (RL)

Reinforcement Learning is a framework where an agent learns to make decisions by interacting with an environment. The core idea is to learn a policy—a strategy for choosing actions—by maximizing the cumulative reward over time. Key aspects include:

  • Trial and Error: The agent explores the environment and learns from the consequences of its actions.
  • Reward-Based Learning: Instead of relying on labeled data, the agent receives rewards (or penalties) based on the outcomes of its actions.
  • Exploration vs. Exploitation: The agent must balance trying new actions (exploration) and leveraging known rewarding actions (exploitation).

RL can be divided into two broad categories:

  • Model-Free RL: The agent learns directly from interactions without an internal model of the environment. - PPO, Q-Learning
  • Model-Based RL: The agent builds a model of the environment to plan its actions.

2. Approaches to Solving RL Problems

Value-Based Methods

These methods focus on estimating the value (or quality) of states or state-action pairs.

  • Examples: Q-Learning, Deep Q-Networks (DQN).
  • Key Idea: Learn a value function that predicts future rewards, then derive the policy indirectly by choosing actions that maximize this value.

Policy Gradient Methods

Policy gradient methods take a direct approach by optimizing the policy itself.

  • Direct Optimization: They adjust policy parameters by computing gradients of a performance measure (often using the likelihood ratio trick).
  • Advantage Estimation: They use estimates of the advantage—how much better an action is compared to the average—to guide updates.
  • Flexibility: These methods naturally handle continuous and stochastic action spaces.

Actor-Critic Methods

These methods blend the strengths of both value-based and policy gradient approaches.

  • Actor: The component that directly updates the policy.
  • Critic: The component that evaluates the current policy by estimating value functions.
  • Benefit: This combination often leads to more stable and efficient learning.

3. Proximal Policy Optimization (PPO)

PPO is a modern algorithm that builds on policy gradient methods while addressing some of their stability challenges. Key features include:

  • Data Collection: The agent gathers experiences by interacting with the environment, generating trajectories of states, actions, and rewards.
  • Surrogate Objective: Instead of optimizing the true expected reward (which is difficult to compute), PPO optimizes a surrogate objective. This objective uses the ratio of new to old policy probabilities, weighted by an advantage estimate.
  • Proximal Updates: PPO employs a clipping mechanism that limits the size of policy updates. This ensures that the new policy remains "proximal" or close to the current policy, preventing drastic changes that could destabilize learning.

Quick Intro to the Leatherback Project

Leatherback - Community Project Leatherback - Community ProjectLeatherback - Community Project Leatherback - Community Project

python scripts\reinforcement_learning\skrl\train.py --task Isaac-Leatherback-Direct-v0 --num_envs 32
tensorboard --logdir=logs\skrl\leatherback_direct\2025-03-24_17-22-19_ppo_torch
  • 4 Revolute Joints for the Wheels (moving forward-backward)
  • 2 Revolute Joints for the Knuckle (steering left-right)
image

PPO - Neural Networks: Actor - Critic

Policy Network (Actor) - “The Decision Maker”

Observations from the Environment

The Policy Network takes in 8 observations:

  • Position Error
  • Cosine of Target Heading Error
  • Sine of Target Heading Error
  • Linear Velocity X
  • Linear Velocity Y
  • Angular Velocity Z
  • Current Throttle State
  • Current Steering State

These observations are fed into our neural network’s shared layers.

image

Action Space

After processing the observations, the network outputs 2 continuous action values:

Throttle Control

  • Purpose: Controls the speed of all four wheels.

Steering Control

  • Purpose: Controls the steering angle of the front wheels.

Gaussian Policy Distribution

Instead of directly outputting actions, our Policy Network provides parameters for a Gaussian distribution for each action:

  • Mean (μ): The network's "best guess" for each action.
💡

Example:

  • For throttle: μ_throttle (e.g., +0.3) = Small movement forward (positive)
  • For steering: μ_steering (e.g., -0.2) = Small angular movement to the left (negative)
  • Log Standard Deviation (log_std): A learnable parameter for each action.
    • One for throttle: Controls exploration in speed.
    • One for steering: Controls exploration in turning.

Example Calculation

  1. Network Output:
  2. The network outputs a mean (μ) for each action.

  3. Learnable Parameter:
  4. The log_std is learned and later exponentiated to yield the standard deviation (σ).

  5. Action Sampling: action ∼ N(μ,σ^2)
  6. The final action is sampled from a Gaussian distribution defined by μ and σ (mean and log_std output).

Exploration vs. Exploitation

  • Early Training:
    • Higher log_std values (e.g., 0.0 → σ = 1.0)
    • Effect: More exploration; the robot tries varied combinations of speed and steering to discover effective actions.
  • Late Training:
    • Lower log_std values (e.g., -2.0 → σ ≈ 0.135)
    • Effect: More exploitation; the robot makes precise, confident movements, crucial for accurate waypoint navigation.

Value Network (Critic) - “The Situation Evaluator”

Input & Output

  • Input: Same 8 observations as the Policy Network.
  • Output: A single scalar value, V(s), estimating the expected sum of future rewards.

Reward Components Considered

  • Position Progress Reward: Weighted by 1.0.
  • Heading Alignment Reward: Weighted by 0.05.
  • Goal Reached Bonus: 10.0
💡

Example:

  • If V(s) = 15, the network predicts:
    • Likely reaching at least one waypoint (bonus of 10).
    • Approximately 5 units accumulated from progress/heading rewards.

PPO Update Process

1. Data Collection Phase

During each rollout, the agent collects experiences by interacting with the environment:

  • Observations
  • Action Sampling:
  • The Policy Network uses these observations to sample actions and returns the log probabilities of those actions.

  • Environment Interaction:
  • Actions are executed in the environment, which returns the next state and a reward.

  • Storage:
  • All data (observations, actions, rewards, next observations) is stored for the update phase.

    Video explaining data
💡

Example:

For one rollout, if 32 steps are taken in 4096 parallel environments, you collect a total of 32×4096 = 131,072 samples.

As in the code decimation is set to 4, it means that each steps includes 4 physics steps.

💡

Important:

The data collection uses the current policy parameters but DOESN'T update them during collection. This separation of collection and update is key to PPO's stability.

2. Epochs & Mini-Batches in PPO

After each rollout (collecting 131,072 samples), PPO processes the data as follows:

  • Mini-Batches:
  • The data is split into 8 mini-batches (each with ~16,384 samples) to improve stability and reduce memory usage.

  • Epochs:
  • The same data is reused for 8 epochs (8 passes over each mini-batch).

    Result: 8 epochs × 8 mini-batches = 64 updates per rollout.

3. Temporal Difference (TD) Error

Formula:

δₜ = rₜ + γ V(sₜ₊₁) − V(sₜ)

In plain words:

  • rₜ: The immediate reward received after taking an action.
  • γ: discount_factor: 0.99 - Sets how much future rewards are valued compared to immediate rewards.
  • γ V(sₜ₊₁): The predicted value of the next state, discounted by γ (gamma), which represents how much future rewards are valued.
  • V(sₜ): The expected value of the current state.

Explanation:

This formula calculates the error between the outcome (immediate reward plus the discounted future potential) and the expectation. A positive δₜ means the outcome was better than expected; a negative δₜ means it was worse.

4. Advantage Estimation (using Generalized Advantage Estimation, GAE)

Formula:

Aₜ = δₜ + γλ Aₜ₊₁

In plain words:

  • Aₜ: The advantage at the current timestep, showing how much better or worse an action performed compared to expectations.
  • δₜ: The immediate error from the TD Error calculation.
  • λ: lambda: 0.95 - Balances bias and variance in the Generalized Advantage Estimation (GAE) by controlling the contribution of future advantages.
  • γλ Aₜ₊₁: A discounted sum of future advantages, where λ (lambda) determines the influence of future errors.

Explanation:

This recursive formula adds immediate and future errors to provide a balanced measure of an action’s overall performance.

Advantage Calculation

  • Definition: A = Actual_Return − V(s)
  • Positive Advantage:
  • Indicates the action performed better than expected.

    💡

    Example: A = +5 might mean a more efficient path was found.

  • Negative Advantage:
  • Indicates the action performed worse than expected.

    💡

    Example: A =−3 might suggest the robot steered too sharply and lost time.

This advantage signal helps the Policy Network update its actions relative to the baseline prediction of the Value Network.

Loss function to Update Network Parameters

image

5. Probability Ratio for Policy Updates

Formula:

rₜ(θ) = πₙ(aₜ|sₜ) / πₒ(aₜ|sₜ)

In plain words:

  • This ratio compares how likely the new policy is to take a certain action versus the old policy.
  • A ratio of 1 indicates no change in likelihood.
  • A ratio greater than 1 means the new policy favors that action more, while less than 1 means it favors it less.

Explanation:

This comparison is used to measure the change in policy, which is crucial for adjusting the network without making overly large updates.

6. Clipping Mechanism in PPO

Concept:

During training, the probability ratio is clipped within a small range (typically 1 ± ε).

In plain words:

  • This prevents the policy from changing too drastically in a single update.
  • It ensures that updates are gradual, which helps maintain stable training.

Explanation:

By clipping the ratio, we avoid large swings that could destabilize the learning process, thereby promoting smoother policy improvements.

💡

Example:

With ε=0.2: - If ratio > 1.2: clipped down to 1.2 - If ratio < 0.8: clipped up to 0.8 This prevents the steering or throttle actions from changing too drastically in one update.

7. Value Loss

Formula:

Value Loss = Mean Squared Error between predicted V(s) and actual returns

In plain words:

  • It measures the error between what the Value Network predicts and the actual rewards received.
  • The goal is to reduce this error so that the network better predicts future rewards.

Explanation:

Minimizing the value loss helps the Value Network become more accurate, which in turn provides a better baseline for policy updates.

8. Total Loss for PPO

Formula:

Total Loss = Policy Loss + (Value Coefficient × Value Loss) − (Entropy Coefficient × Entropy Bonus)

In plain words:

  • Policy Loss: Encourages actions that perform better than expected by using the probability ratio and advantage.
  • Value Loss: Improves the accuracy of the state-value predictions.
  • Entropy Bonus: Promotes exploration by rewarding randomness in the policy.
  • The coefficients balance how much each component influences the overall loss.

Explanation:

This combined loss ensures that the policy learns to select better actions while also fine-tuning the value predictions and maintaining a level of exploration necessary for robust learning.

9. Loss Update

The total loss serves as the combined objective that the PPO algorithm minimizes during training. It integrates several components so that both the policy network and the value network are improved simultaneously. Here’s what it does:

  • Guides Policy Improvements:
  • The policy loss (using the probability ratio and advantage) encourages the network to favor actions that perform better than expected.

  • Enhances Value Predictions:
  • The value loss minimizes the error between the network’s predicted state values and the actual returns, helping the critic become more accurate.

  • Maintains Exploration:
  • The entropy bonus rewards the network for maintaining some randomness, which prevents the policy from becoming overly deterministic too quickly.

  • Balances All Components:
  • The coefficients (for value loss and entropy bonus) control the relative importance of each component in the overall loss.

  • Learning Rate & KL Divergence:
    1. The learning rate (η) determines how large the update steps are during gradient descent. PPO uses a KL Adaptive Learning Rate Scheduler, which monitors the KL divergence—a measure of how much the new policy differs from the old one.

    2. KL Divergence:
    3. This is a statistical measure of the difference between two probability distributions. In PPO, a low KL divergence means the new policy remains close to the old policy, while a high KL divergence indicates a significant change.

    4. Integration with Learning Rate:
    5. If the KL divergence exceeds a preset threshold, the learning rate is reduced to prevent overly large updates, ensuring that the policy changes gradually. Conversely, if the KL divergence is much lower than the target, the learning rate may be increased to allow for faster learning.

By minimizing this total loss (using gradient descent) and adjusting the learning rate based on KL divergence, the PPO algorithm updates the network parameters to both maximize rewards (via the policy network) and accurately estimate future returns (via the value network), all while ensuring stable and gradual policy improvements.