Config Explanation¶
We use verl as the base RL training framework and support the multi-turn online training for RAGEN. To config the training for RAGEN, we provide a configuration file config/base.yaml.
base.yaml for RAGEN Environment¶
Training Paradigm¶
rl_or_sft: Specifies the training paradigm. Can be either "rl" for reinforcement learning or "sft" for supervised fine-tuning.
System Settings¶
system:
cuda_visible_devices: 0
n_gpus: 1
multi_processing: "ray"
vllm_attention_backend: "XFORMERS"
system:
cuda_visible_devices: Specifies which CUDA devices are visible to the process. Default is 0.n_gpus: Number of GPUs to use for training. Default is 1.multi_processing: Backend for multiprocessing operations. Set to "ray".vllm_attention_backend: Attention implementation backend. Set to "XFORMERS".
Model Settings¶
model:
base_model: "Qwen/Qwen2.5-0.5B-Instruct"
experiment_name: "ragen-main-exp"
gradient_checkpointing: true
model:
base_model: The foundation model path. Points to "Qwen/Qwen2.5-0.5B-Instruct".experiment_name: Name of the experiment for tracking purposes.gradient_checkpointing: Whether to enable gradient checkpointing to save memory. Default is true.
Training Parameters¶
training:
train_data_num: null
val_data_num: 50
micro_batch_size: 1
train_batch_size: 8
val_batch_size: 10
ppo_batch_size: 128
max_start_length: 400
max_response_length: 400
max_obs_length: 200
max_turns: 5
rollout_tp_size: 1
n_rollout: 16
total_epochs: 5
temperature: 0.7
use_sft: false
use_kl_loss: true # this means actor KL Loss
no_think_rl: false
state_masking: false
binary_reward: false
mask_state: false
length_penalty: false
ref_update_steps: null # every ref_update_steps, the reference model will be updated using the latest actor model
total_training_steps: null
training:
train_data_num: Number of training examples to use. If null, uses all available data.val_data_num: Number of validation examples. Default is 50.micro_batch_size: Real batch size for individual forward passes in GPU. Default is 1.train_batch_size: Overall training batch size. Default is 8.val_batch_size: Batch size during validation. Default is 10.ppo_batch_size: Batch size for PPO updates. Default is 128.max_start_length: Maximum length of the initial prompt. Default is 400.max_response_length: Maximum length of model responses. Default is 400.max_obs_length: Maximum length of observations. Default is 200.max_turns: Maximum number of interaction turns. Default is 5.rollout_tp_size: Tensor parallelism size for rollouts. Default is 1.n_rollout: Number of rollouts to perform. Default is 16.total_epochs: Total number of training epochs. Default is 5.temperature: Sampling temperature for generation. Default is 0.7.use_sft: Whether to use supervised fine-tuning. Default is false.use_kl_loss: Whether to use KL divergence loss for actor. Default is true.no_think_rl: Whether to disable thinking during RL. Default is false.state_masking: Whether to enable state masking. Default is false.binary_reward: Whether to use binary rewards. Default is false.mask_state: Whether to mask the state. Default is false.length_penalty: Whether to apply length penalty. Default is false.ref_update_steps: Frequency of reference model updates. If null, no updates occur.total_training_steps: Total number of training steps. If null, determined by epochs.
Note
NOTED: train_batch_size * n_rollout must be greater than or equal to ppo_batch_size and divisible by ppo_batch_size. In our practice, it is recommended to set train_batch_size * n_rollout 4 times larger than ppo_batch_size.
Optimization Parameters¶
optimization:
actor_lr: 1e-6
critic_lr: 1e-5
kl_coef: 0.001
kl_loss_type: low_var_kl
adv_estimator: grpo
gpu_memory_utilization: 0.4
optimization:
actor_lr: Learning rate for the actor model. Default is 1e-6.critic_lr: Learning rate for the critic model. Default is 1e-5.kl_coef: Coefficient for KL divergence term. Default is 0.001.kl_loss_type: Type of KL loss calculation. Set to "low_var_kl".adv_estimator: Type of advantage estimator. Set to "grpo".gpu_memory_utilization: Fraction of GPU memory to utilize. Default is 0.4.
Logging Settings¶
logging:
mode: "['wandb']"
log_images: true
log_image_dir: "log/trajectory"
log_image_step_size: 4
log_n_image_per_batch: 32
logging:
mode: Logging backends to use. Default is "['wandb']".log_images: Whether to log images. Default is true.log_image_dir: Directory for saving logged images. Default is "log/trajectory".log_image_step_size: Frequency of image logging. Default is 4.log_n_image_per_batch: Number of images to log per batch. Default is 32.
Trainer Settings¶
trainer:
val_before_train: true
val_only: false
default_hdfs_dir: null
nnodes: 1
save_freq: 100
test_freq: 200 # 100
project_name: "RAGEN"
trainer:
val_before_train: Whether to validate before training. Default is true.val_only: Whether to run only validation. Default is false.default_hdfs_dir: HDFS directory for checkpoints. Default is null.nnodes: Number of nodes for training. Default is 1.save_freq: Frequency of model saving. Default is 100.test_freq: Frequency of validation. Default is 200.project_name: Name of the project. Set to "RAGEN".
SFT Settings¶
# SFT settings
sft:
env_type: "sokoban" # or "frozenlake"
output_dir: "models/sft"
# Data generation parameters
data_generation:
data_dir: "sft/data"
algo: "bfs"
seed: 100000 # needs to be different from the seed in the RL config
train_size: 1000
test_size: 100
bfs_max_depths: 100
prefix: "message"
num_processes: 16
# Training parameters
training:
num_gpus: 1
max_length: 2048
learning_rate: 1e-4
train_batch_size: 128
micro_batch_size: 4
experiment_name: "test_sft_lora"
logger: "['console','wandb']"
epochs: 5
hdfs_dir: null
validate_before_training: true
lora_rank: 64
lora_alpha: 32
target_modules: "all-linear"
enable_gradient_checkpointing: false
base_model: "Qwen/Qwen2.5-0.5B-Instruct"
project_name: "RAGEN"
# Sokoban-specific settings
sokoban:
dim_x: 6
dim_y: 6
num_boxes: 1
max_steps: 100
search_depth: 30
# FrozenLake-specific settings
frozenlake:
size: 4
p: 0.8
is_slippery: true
sft: Configuration for supervised fine-tuning
Environment Settings¶
env_type: Type of environment. Can be "sokoban" or "frozenlake".output_dir: Directory for saving SFT models. Default is "models/sft".
Data Generation Parameters¶
data_generation:
data_dir: Directory for generated data. Default is "sft/data".algo: Algorithm for data generation. Default is "bfs".seed: Random seed. Default is 100000.train_size: Number of training examples. Default is 1000.test_size: Number of test examples. Default is 100.bfs_max_depths: Maximum depth for BFS. Default is 100.prefix: Message prefix. Default is "message".num_processes: Number of processes for data generation. Default is 16.
SFT Training Parameters¶
training:
num_gpus: Number of GPUs for SFT. Default is 1.max_length: Maximum sequence length. Default is 2048.learning_rate: Learning rate for SFT. Default is 1e-4.train_batch_size: Training batch size. Default is 128.micro_batch_size: Micro batch size. Default is 4.experiment_name: Name of the experiment. Default is "test_sft_lora".logger: Logging backends. Default is "['console','wandb']".epochs: Number of training epochs. Default is 5.hdfs_dir: HDFS directory. Default is null.validate_before_training: Whether to validate before training. Default is true.lora_rank: Rank for LoRA adaptation. Default is 64.lora_alpha: Alpha parameter for LoRA. Default is 32.target_modules: Target modules for LoRA. Default is "all-linear".enable_gradient_checkpointing: Whether to enable gradient checkpointing. Default is false.base_model: Base model path. Default is "Qwen/Qwen2.5-0.5B-Instruct".project_name: Project name. Set to "RAGEN".
Environment-Specific Settings¶
sokoban:
dim_x: X dimension of Sokoban grid. Default is 6.dim_y: Y dimension of Sokoban grid. Default is 6.num_boxes: Number of boxes in Sokoban. Default is 1.max_steps: Maximum steps allowed. Default is 100.search_depth: Search depth for solution finding. Default is 30.
frozenlake:
size: Size of FrozenLake grid. Default is 4.p: Success probability for intended action. Default is 0.8.is_slippery: Whether the lake is slippery. Default is true.
ppo_trainer.yaml for verl Trainer¶
Detailed configurations for verl trainer can be found in their official documentation Configuration Explanation part.