Skip to content

Config Explanation

We use verl as the base RL training framework and support the multi-turn online training for RAGEN. To config the training for RAGEN, we provide a configuration file config/base.yaml.

base.yaml for RAGEN Environment

Training Paradigm

rl_or_sft: "rl"
rl_or_sft: Specifies the training paradigm. Can be either "rl" for reinforcement learning or "sft" for supervised fine-tuning.

System Settings

system:
  cuda_visible_devices: 0
  n_gpus: 1
  multi_processing: "ray"
  vllm_attention_backend: "XFORMERS"
system:

  • cuda_visible_devices: Specifies which CUDA devices are visible to the process. Default is 0.
  • n_gpus: Number of GPUs to use for training. Default is 1.
  • multi_processing: Backend for multiprocessing operations. Set to "ray".
  • vllm_attention_backend: Attention implementation backend. Set to "XFORMERS".

Model Settings

model:
  base_model: "Qwen/Qwen2.5-0.5B-Instruct"
  experiment_name: "ragen-main-exp"
  gradient_checkpointing: true
model:

  • base_model: The foundation model path. Points to "Qwen/Qwen2.5-0.5B-Instruct".
  • experiment_name: Name of the experiment for tracking purposes.
  • gradient_checkpointing: Whether to enable gradient checkpointing to save memory. Default is true.

Training Parameters

training:
  train_data_num: null
  val_data_num: 50
  micro_batch_size: 1
  train_batch_size: 8
  val_batch_size: 10
  ppo_batch_size: 128
  max_start_length: 400
  max_response_length: 400
  max_obs_length: 200
  max_turns: 5
  rollout_tp_size: 1
  n_rollout: 16
  total_epochs: 5
  temperature: 0.7
  use_sft: false
  use_kl_loss: true # this means actor KL Loss
  no_think_rl: false
  state_masking: false
  binary_reward: false
  mask_state: false
  length_penalty: false
  ref_update_steps: null # every ref_update_steps, the reference model will be updated using the latest actor model
  total_training_steps: null
training:

  • train_data_num: Number of training examples to use. If null, uses all available data.
  • val_data_num: Number of validation examples. Default is 50.
  • micro_batch_size: Real batch size for individual forward passes in GPU. Default is 1.
  • train_batch_size: Overall training batch size. Default is 8.
  • val_batch_size: Batch size during validation. Default is 10.
  • ppo_batch_size: Batch size for PPO updates. Default is 128.
  • max_start_length: Maximum length of the initial prompt. Default is 400.
  • max_response_length: Maximum length of model responses. Default is 400.
  • max_obs_length: Maximum length of observations. Default is 200.
  • max_turns: Maximum number of interaction turns. Default is 5.
  • rollout_tp_size: Tensor parallelism size for rollouts. Default is 1.
  • n_rollout: Number of rollouts to perform. Default is 16.
  • total_epochs: Total number of training epochs. Default is 5.
  • temperature: Sampling temperature for generation. Default is 0.7.
  • use_sft: Whether to use supervised fine-tuning. Default is false.
  • use_kl_loss: Whether to use KL divergence loss for actor. Default is true.
  • no_think_rl: Whether to disable thinking during RL. Default is false.
  • state_masking: Whether to enable state masking. Default is false.
  • binary_reward: Whether to use binary rewards. Default is false.
  • mask_state: Whether to mask the state. Default is false.
  • length_penalty: Whether to apply length penalty. Default is false.
  • ref_update_steps: Frequency of reference model updates. If null, no updates occur.
  • total_training_steps: Total number of training steps. If null, determined by epochs.

Note

NOTED: train_batch_size * n_rollout must be greater than or equal to ppo_batch_size and divisible by ppo_batch_size. In our practice, it is recommended to set train_batch_size * n_rollout 4 times larger than ppo_batch_size.

Optimization Parameters

optimization:
  actor_lr: 1e-6
  critic_lr: 1e-5
  kl_coef: 0.001
  kl_loss_type: low_var_kl
  adv_estimator: grpo
  gpu_memory_utilization: 0.4
optimization:

  • actor_lr: Learning rate for the actor model. Default is 1e-6.
  • critic_lr: Learning rate for the critic model. Default is 1e-5.
  • kl_coef: Coefficient for KL divergence term. Default is 0.001.
  • kl_loss_type: Type of KL loss calculation. Set to "low_var_kl".
  • adv_estimator: Type of advantage estimator. Set to "grpo".
  • gpu_memory_utilization: Fraction of GPU memory to utilize. Default is 0.4.

Logging Settings

logging:
  mode: "['wandb']"
  log_images: true
  log_image_dir: "log/trajectory"
  log_image_step_size: 4
  log_n_image_per_batch: 32
logging:

  • mode: Logging backends to use. Default is "['wandb']".
  • log_images: Whether to log images. Default is true.
  • log_image_dir: Directory for saving logged images. Default is "log/trajectory".
  • log_image_step_size: Frequency of image logging. Default is 4.
  • log_n_image_per_batch: Number of images to log per batch. Default is 32.

Trainer Settings

trainer:
  val_before_train: true
  val_only: false
  default_hdfs_dir: null
  nnodes: 1
  save_freq: 100
  test_freq: 200 # 100
  project_name: "RAGEN"
trainer:

  • val_before_train: Whether to validate before training. Default is true.
  • val_only: Whether to run only validation. Default is false.
  • default_hdfs_dir: HDFS directory for checkpoints. Default is null.
  • nnodes: Number of nodes for training. Default is 1.
  • save_freq: Frequency of model saving. Default is 100.
  • test_freq: Frequency of validation. Default is 200.
  • project_name: Name of the project. Set to "RAGEN".

SFT Settings

# SFT settings
sft:
  env_type: "sokoban"  # or "frozenlake"
  output_dir: "models/sft"

  # Data generation parameters
  data_generation:
    data_dir: "sft/data"
    algo: "bfs"
    seed: 100000 # needs to be different from the seed in the RL config
    train_size: 1000
    test_size: 100
    bfs_max_depths: 100
    prefix: "message"
    num_processes: 16

  # Training parameters
  training:
    num_gpus: 1
    max_length: 2048
    learning_rate: 1e-4
    train_batch_size: 128
    micro_batch_size: 4
    experiment_name: "test_sft_lora"
    logger: "['console','wandb']"
    epochs: 5
    hdfs_dir: null
    validate_before_training: true
    lora_rank: 64
    lora_alpha: 32
    target_modules: "all-linear"
    enable_gradient_checkpointing: false
    base_model: "Qwen/Qwen2.5-0.5B-Instruct"
    project_name: "RAGEN"

  # Sokoban-specific settings
  sokoban:
    dim_x: 6
    dim_y: 6
    num_boxes: 1
    max_steps: 100
    search_depth: 30

  # FrozenLake-specific settings
  frozenlake:
    size: 4
    p: 0.8
    is_slippery: true
sft: Configuration for supervised fine-tuning

Environment Settings

  • env_type: Type of environment. Can be "sokoban" or "frozenlake".
  • output_dir: Directory for saving SFT models. Default is "models/sft".

Data Generation Parameters

data_generation:

  • data_dir: Directory for generated data. Default is "sft/data".
  • algo: Algorithm for data generation. Default is "bfs".
  • seed: Random seed. Default is 100000.
  • train_size: Number of training examples. Default is 1000.
  • test_size: Number of test examples. Default is 100.
  • bfs_max_depths: Maximum depth for BFS. Default is 100.
  • prefix: Message prefix. Default is "message".
  • num_processes: Number of processes for data generation. Default is 16.

SFT Training Parameters

training:

  • num_gpus: Number of GPUs for SFT. Default is 1.
  • max_length: Maximum sequence length. Default is 2048.
  • learning_rate: Learning rate for SFT. Default is 1e-4.
  • train_batch_size: Training batch size. Default is 128.
  • micro_batch_size: Micro batch size. Default is 4.
  • experiment_name: Name of the experiment. Default is "test_sft_lora".
  • logger: Logging backends. Default is "['console','wandb']".
  • epochs: Number of training epochs. Default is 5.
  • hdfs_dir: HDFS directory. Default is null.
  • validate_before_training: Whether to validate before training. Default is true.
  • lora_rank: Rank for LoRA adaptation. Default is 64.
  • lora_alpha: Alpha parameter for LoRA. Default is 32.
  • target_modules: Target modules for LoRA. Default is "all-linear".
  • enable_gradient_checkpointing: Whether to enable gradient checkpointing. Default is false.
  • base_model: Base model path. Default is "Qwen/Qwen2.5-0.5B-Instruct".
  • project_name: Project name. Set to "RAGEN".

Environment-Specific Settings

sokoban:

  • dim_x: X dimension of Sokoban grid. Default is 6.
  • dim_y: Y dimension of Sokoban grid. Default is 6.
  • num_boxes: Number of boxes in Sokoban. Default is 1.
  • max_steps: Maximum steps allowed. Default is 100.
  • search_depth: Search depth for solution finding. Default is 30.

frozenlake:

  • size: Size of FrozenLake grid. Default is 4.
  • p: Success probability for intended action. Default is 0.8.
  • is_slippery: Whether the lake is slippery. Default is true.

ppo_trainer.yaml for verl Trainer

Detailed configurations for verl trainer can be found in their official documentation Configuration Explanation part.