Skip to content

Curriculum Learning Framework

Deep Multi-Agent Reinforcement Learning (MARL) algorithms like PPO and MAPPO notoriously struggle with sparse rewards in massive state spaces. Attempting to train an agent directly on a 100-node enterprise network often results in failure to converge.

To solve this, NetForge RL includes a native CurriculumWrapper that scales the difficulty of the environment progressively as the policy improves.

The Curriculum Wrapper

The CurriculumWrapper wraps the core NetForgeRLEnv. It monitors the episodic reward and automatically advances the environment to the next "phase" of complexity once a moving average threshold is met.

Phase Progression

The curriculum is divided into stages (PhaseConfig), gradually expanding the state space and introducing complex mechanics.

  1. Novice Phase:
  2. Scale: 5 maximum active hosts.
  3. Scenarios: ransomware only.
  4. Mechanics: Static topology, no DHCP shuffling, high reward scaling (3.0x).
  5. Goal: Teach the agent the basic mechanics of scanning, exploiting, and lateral movement.

  6. Intermediate Phase:

  7. Scale: 25 maximum active hosts.
  8. Scenarios: ransomware, apt_espionage.
  9. Mechanics: Periodic DHCP IP reshuffling (every 80 ticks), standard reward scaling (1.5x).
  10. Goal: Introduce subnet routing and invalidate overfit IP-address memorization.

  11. Expert Phase:

  12. Scale: 100 maximum active hosts.
  13. Scenarios: ransomware, apt_espionage, iot_grid.
  14. Mechanics: Fast DHCP IP reshuffling (every 40 ticks), active Dynamic Topologies (churn, migration, arrival), unscaled rewards (1.0x).
  15. Goal: Full mastery of zero-trust environments with high non-stationarity.

Usage

Wrapping your environment in the curriculum is straightforward:

from netforge_rl.environment.curriculum import CurriculumWrapper, PHASES

# Initialize the curriculum with the predefined phases
env = CurriculumWrapper(
    phases=PHASES,
    base_cfg={'nlp_backend': 'tfidf'},
    start_phase=0
)

obs, info = env.reset()

Observation Info Dictionary

The wrapper injects a special __curriculum__ key into the info dictionary returned by env.step() and env.reset(). This allows you to log curriculum progression to W&B or TensorBoard natively:

info['red_operator']['__curriculum__'] = {
    'phase': 'novice',
    'phase_index': 0,
    'advance_threshold': 60.0,
    'mean_reward': 45.2,
    'window_fill': 1.0,         # 1.0 means the rolling window is full
    'phase_advanced': False     # True if the phase upgraded on this step
}