Configuration Files¶
Base Configuration (base.yaml)¶
The base configuration file serves as the foundation for all training runs. It defines core parameters for the RAGEN system.
Defaults and System Settings¶
defaults:
- ppo_trainer # symbolic link to verl/verl/trainer/config/ppo_trainer.yaml
- envs
system:
CUDA_VISIBLE_DEVICES: "0"
defaults: Specifies which configuration files to includeppo_trainer: Links to the PPO trainer configurationenvs: Links to specific environment settings configurations
system.CUDA_VISIBLE_DEVICES: Specifies which GPU to use for training
Core Training Parameters¶
micro_batch_size_per_gpu: 4
ppo_mini_batch_size: 32
model_path: Qwen/Qwen2.5-0.5B-Instruct
enable_response_mask: True
micro_batch_size_per_gpu: Batch size for each GPU during trainingppo_mini_batch_size: Batch size for PPO policy updatesmodel_path: Base model to use for trainingenable_response_mask: Whether to enable response masking for improved stability in rollout/old_log_prob calculations
LoRA Settings¶
lora.rank: Rank of LoRA adaptation,0means not using LoRAlora.alpha: Alpha parameter for LoRA scalinglora.target_modules: Which modules to apply LoRA to (all-linear means all linear layers)
Actor-Rollout-Ref Settings¶
actor_rollout_ref:
model:
path: ${model_path}
lora_rank: ${lora.rank}
lora_alpha: ${lora.alpha}
target_modules: ${lora.target_modules}
actor:
ppo_mini_batch_size: ${ppo_mini_batch_size} # by default, ppo_mini_batch_size = train_batch_size / 4
micro_batch_size_per_gpu: ${micro_batch_size_per_gpu} # following micro_batch_size_per_gpu
ppo_micro_batch_size_per_gpu: ${micro_batch_size_per_gpu} # following micro_batch_size_per_gpu
use_ref: True
entropy_coeff: 0.001
use_kl_loss: False
kl_loss_coef: 0.000
kl_loss_type: kl
clip_ratio_low: 0.2
clip_ratio_high: 0.28
optim:
betas: [0.9, 0.999]
ref:
log_prob_micro_batch_size_per_gpu: ${micro_batch_size_per_gpu} # following micro_batch_size_per_gpu
rollout:
log_prob_micro_batch_size_per_gpu: ${micro_batch_size_per_gpu} # following micro_batch_size_per_gpu
tensor_model_parallel_size: 1
max_model_len: 3600
prompt_length: 1 # useless. Just put it here
response_length: 400 # single-turn response length
gpu_memory_utilization: 0.5
max_num_batched_tokens: 8192 # set only when enable_chunked_prefill is true
temperature: 1
rollout_filter_ratio: 0.25
rollout_filter_type: std # max_mean or std
enforce_eager: True # for small models, set both enforce_eager and free_cache_engine to False to make rollout faster
free_cache_engine: True
val_kwargs:
do_sample: True
temperature: 0.5
tp_size_check: true
Model Settings¶
actor_rollout_ref.model.path: Path to the base model, inherited from the globalmodel_pathsetting. loraactor_rollout_ref.model.lora_rank: Rank of LoRA adaptation for the actor. Inherited from global LoRA settings.actor_rollout_ref.model.lora_alpha: Alpha parameter for LoRA scaling in the actor. Inherited from global LoRA settings.actor_rollout_ref.model.lora_target_modules: Which modules to apply LoRA to in the actor. Inherited from global LoRA settings.
Actor Settings¶
actor_rollout_ref.actor.ppo_mini_batch_size: Batch size for PPO policy updates. By default, this is set to train_batch_size/4 to ensure stable training.actor_rollout_ref.actor.micro_batch_size_per_gpu: Batch size for each GPU during actor forward passes. This is synchronized with the global micro_batch_size_per_gpu setting.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: Batch size for each GPU during PPO updates. This is also synchronized with the global micro_batch_size_per_gpu setting.actor_rollout_ref.actor.use_ref: Whether to use a reference policy during training. When True, the reference policy is used to compute KL divergence.actor_rollout_ref.actor.entropy_coeff: Coefficient for entropy regularization. This encourages exploration by adding entropy to the policy loss.actor_rollout_ref.actor.use_kl_loss: Whether to use KL divergence loss in the actor's objective. When False, only the PPO clip loss is used.actor_rollout_ref.actor.kl_loss_coef: Coefficient for the KL divergence loss term. This controls how strongly the policy is regularized towards the reference policy.actor_rollout_ref.actor.kl_loss_type: Type of KL divergence calculation. Currently set to 'kl' for standard KL divergence.actor_rollout_ref.actor.clip_ratio_low: Lower bound for PPO clip ratio. Actions with probability ratios below this value will be clipped.actor_rollout_ref.actor.clip_ratio_high: Upper bound for PPO clip ratio. Actions with probability ratios above this value will be clipped.actor_rollout_ref.actor.optim.betas: Beta parameters for the Adam optimizer. [0.9, 0.999] are the default values for Adam.
Reference Policy Settings¶
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: Batch size for computing log probabilities with the reference policy. Synchronized with the global micro_batch_size_per_gpu setting.
Rollout Settings¶
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu: Batch size for computing log probabilities during rollouts. Synchronized with the global micro_batch_size_per_gpu setting.actor_rollout_ref.rollout.tensor_model_parallel_size: Size of tensor parallelism for model distribution. Set to 1 for single-GPU training.actor_rollout_ref.rollout.max_model_len: Maximum context length for the model. Set to 3600 to accommodate long sequences.actor_rollout_ref.rollout.prompt_length: Length of the prompt (currently unused, kept for compatibility).actor_rollout_ref.rollout.response_length: Maximum length for single-turn responses. Set to 400 tokens.actor_rollout_ref.rollout.gpu_memory_utilization: Fraction of GPU memory to utilize during rollouts. Set to 0.5 for balanced memory usage.actor_rollout_ref.rollout.max_num_batched_tokens: Maximum number of tokens to process in a single batch. Only used when enable_chunked_prefill is true.actor_rollout_ref.rollout.temperature: Sampling temperature for generation during rollouts. Set to 1.0 for standard sampling.actor_rollout_ref.rollout.rollout_filter_ratio: Ratio for filtering rollouts based on their quality. Set to 0.25 to keep the top 25% of rollouts.actor_rollout_ref.rollout.rollout_filter_type: Type of filtering to apply to rollouts. 'std' uses standard deviation, 'max_mean' uses maximum mean.actor_rollout_ref.rollout.enforce_eager: Whether to enforce eager execution mode. Set to False for small models to improve rollout speed.actor_rollout_ref.rollout.free_cache_engine: Whether to free the cache engine after each rollout. Set to False for small models to improve speed.actor_rollout_ref.rollout.tp_size_check: Whether to enable tensor parallelism size checking. Set to true to verify that the tensor parallelism configuration matches the model's requirements.
Rollout Validation Settings¶
actor_rollout_ref.rollout.val_kwargs.do_sample: Whether to use sampling during validation.actor_rollout_ref.rollout.val_kwargs.temperature: Temperature for validation generation.
Critic Settings¶
critic:
ppo_mini_batch_size: ${ppo_mini_batch_size} # by default, ppo_mini_batch_size = train_batch_size / 4
ppo_micro_batch_size_per_gpu: ${micro_batch_size_per_gpu} # following micro_batch_size_per_gpu
model:
path: ${model_path}
optim:
betas: [0.9, 0.999]
lora:
enabled: ${lora.enabled}
rank: ${lora.rank}
alpha: ${lora.alpha}
target_modules: ${lora.target_modules}
Critic Training Parameters¶
critic.ppo_mini_batch_size: Batch size for critic's PPO updatescritic.ppo_micro_batch_size_per_gpu: Batch size per GPU for critic's forward passes
Critic Model Settings¶
critic.model.path: Path to the critic model, inherited from model_path
Critic Optimizer Settings¶
critic.optim.betas: Beta parameters for Adam optimizer [0.9, 0.999]
Critic LoRA Settings¶
critic.lora.enabled: Whether to use LoRA for critic modelcritic.lora.rank: Rank of LoRA adaptationcritic.lora.alpha: Alpha parameter for LoRA scalingcritic.lora.target_modules: Which modules to apply LoRA to
Data Settings¶
data.max_prompt_length: Maximum prompt length (null means use default)data.max_response_length: Maximum response length (null means use default)data.train_batch_size: Training batch size (null means use default)
Algorithm Settings¶
algorithm:
gamma: 1.0
lam: 1.0
adv_estimator: gae
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.000
algorithm.gamma: Discount factor for rewardsalgorithm.lam: Lambda parameter for GAEalgorithm.adv_estimator: Type of advantage estimator (gae/grpo)algorithm.kl_penalty: Type of KL penaltyalgorithm.kl_ctrl.type: Type of KL controlalgorithm.kl_ctrl.kl_coef: Coefficient for KL control
Trainer Settings¶
trainer:
project_name: ragen_latest
experiment_name: test
total_training_steps: 200
validation_steps: 1 # validation instances = validation_steps * val_env_groups * group_size
val_before_train: True
n_gpus_per_node: 1
test_freq: 10
generations_to_log_to_wandb:
train: 128 # TODO: will be implemented
val: 20
logger: [ 'console', 'wandb' ]
trainer.project_name: Name of the projecttrainer.experiment_name: Name of the experimenttrainer.total_training_steps: Total number of training stepstrainer.validation_steps: Number of validation stepstrainer.val_before_train: Whether to validate before trainingtrainer.n_gpus_per_node: Number of GPUs per nodetrainer.test_freq: Frequency of testingtrainer.generations_to_log_to_wandb: Number of generations to logtrainer.logger: Logging backends to use
Agent Proxy Settings¶
agent_proxy:
max_turn: 5
action_sep: "||"
max_actions_per_turn: 5 # how many actions can be output at most in a single turn
use_turn_scores: False # important to GAE when applying token-level rewards to token-level advantages. If False, will take the sum of scores as the reward for the last turn.
enable_think: True # False -> no think RL
reward_normalization:
grouping: "state" # state / batch / inductive
method: "identity" # asym_clip / identity / mean_std
agent_proxy.max_turn: Maximum number of turns
- agent_proxy.action_sep: Separator for actions
- agent_proxy.max_actions_per_turn: Maximum actions per turn
- agent_proxy.use_turn_scores: Whether to use turn-level scores
- agent_proxy.enable_think: Whether to enable thinking steps
- agent_proxy.reward_normalization.grouping: How to group rewards
- agent_proxy.reward_normalization.method: Method for reward normalization
Environment Manager Settings¶
es_manager:
format_penalty: -0.1
train:
env_groups: 8
# under the same group, the env config and env seed are ensured to be equal
group_size: 16
env_configs:
tags: ["SimpleSokoban"]
n_groups: [8] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
val:
env_groups: 256
group_size: 1 # should be set to 1 because when val temperature is set to 0 and group size > 1, there will be repetitive prompts which leads to same trajectory.
env_configs:
tags: ["SimpleSokoban"]
n_groups: [256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
es_manager.format_penalty: Penalty for format violationses_manager.train.env_groups: Number of environment groups for traininges_manager.train.group_size: Number of trajectories under a single initial state rolloutes_manager.train.env_configs.tags: Environment types to usees_manager.train.env_configs.n_groups: Number of initial stateses_manager.val.*: Similar settings for validation
Context Manager Settings¶
ctx_manager:
generation: # go to vllm
gen_config:
response_length: ${actor_rollout_ref.rollout.response_length}
temperature: ${actor_rollout_ref.rollout.temperature}
top_p: ${actor_rollout_ref.rollout.top_p}
top_k: ${actor_rollout_ref.rollout.top_k}
kwargs: null
ctx_manager.generation.gen_config: Generation configurationresponse_length: Maximum response lengthtemperature: Sampling temperaturetop_p: Top-p sampling parametertop_k: Top-k sampling parameterkwargs: Additional generation parameters
ppo_trainer.yaml for verl Trainer¶
Detailed configurations for verl trainer can be found in their official documentation Configuration Explanation part.
Environment-Specific Configuration¶
Environment-specific configurations (e.g., _2_sokoban.yaml) inherit from base.yaml and add task-specific settings:
These files only override a few parameters specific to the environment.
Custom Environment Configuration (envs.yaml)¶
The envs.yaml file defines custom environments that can be used in training. Here's an example of how to configure different environments:
custom_envs:
SimpleSokoban:
env_type: sokoban
max_actions_per_traj: 10 # Maximum actions allowed per trajectory
env_instruction: "You are solving the Sokoban puzzle. You are the player and you need to push all boxes to targets. When you are right next to a box, you can push it by moving in the same direction. You cannot push a box through a wall, and you cannot pull a box. The answer should be a sequence of actions, like <answer>Right || Right || Up</answer>"
max_tokens: 100 # Maximum tokens for LLM prompt
env_config: # Environment-specific configuration
dim_x: 6
dim_y: 6
num_boxes: 1
max_steps: 100
FrozenLake:
env_type: frozen_lake
max_actions_per_traj: 10
env_instruction: "You are solving the FrozenLake puzzle. Forbid the whole and go to the target. You may move to the unintended direction due to the slippery ice. Example answer format: <think>To forbid the hole and go to the target, I should go left then go up.</think><answer>Left || Up</answer>"
max_tokens: 100
env_config: null
Key Configuration Parameters¶
Common Parameters¶
env_type: The type of environment (e.g., sokoban, frozen_lake, webshop)max_actions_per_traj: Maximum number of actions allowed in a single trajectoryenv_instruction: Instructions for the LLM agentmax_tokens: Maximum token length for the LLM promptenv_config: Environment-specific settings
Environment-Specific Configurations¶
Sokoban Environment¶
env_config:
dim_x: 6 # Width of the grid
dim_y: 6 # Height of the grid
num_boxes: 1 # Number of boxes to push
max_steps: 100 # Maximum steps allowed
search_depth: 10 # Optional: Search depth for environment generation
grid_lookup: {0: "W", 1: ".", 2: "G", 3: "C", 4: "B", 5: "A", 6: "@"} # Optional: Custom grid representation
grid_vocab: {"W": "wall", ".": "empty", "G": "target", "C": "box on target", "B": "box", "A": "player", "@": "player on target"} # Optional: Custom grid vocabulary
render_mode: "rgb_array" # Optional: For visual rendering
Bandit Environment¶
env_config:
lo_arm_name: "Phoenix" # Name of the low-reward arm
hi_arm_name: "Dragon" # Name of the high-reward arm
Using Custom Environments¶
To use a custom environment in your training:
- Define the environment in
config/envs.yaml - Reference it in your training configuration:
Best Practices¶
- Always specify critical parameters in YAML files rather than relying on command-line overrides for better reproducibility.
- Use environment-specific YAML files for parameters that are consistent across runs for that environment.
- Use command-line overrides for experimental variations or one-off changes.
- Document any non-standard parameter combinations in experiment logs.