Research focus
Current bottleneck
Formalize measurable leading indicators and a falsifiable benchmark.
Leading indicators in policy behavior and learning dynamics can reveal objective exploitation before aggregate reward declines.
Current flagship
Develop mathematically rigorous methods that detect reward hacking before reward degradation becomes observable.
Research focus
Formalize measurable leading indicators and a falsifiable benchmark.
Leading indicators in policy behavior and learning dynamics can reveal objective exploitation before aggregate reward declines.
Problem map
Reliable measurement of agent capability and failure.
Instrumentation for understanding agent internals and behavior.
Performance changes when deployment differs from training.
Mismatch between optimized and intended objectives.
Policies exploiting flaws in specified rewards.
Changes in reward model behavior over time.
Sequential decision making under explicit safety constraints.
Systems that expose leading indicators of unsafe behavior.
Current bottleneck: Formalize measurable leading indicators and a falsifiable benchmark.
Deferred: Specify a controlled reward-hacking onset benchmark