Flight Recorder

Develop mathematically rigorous methods that detect reward hacking before reward degradation becomes observable.

19research problems

3active research tasks

Research focus

Current bottleneck

Formalize measurable leading indicators and a falsifiable benchmark.

Working hypothesis

Leading indicators in policy behavior and learning dynamics can reveal objective exploitation before aggregate reward declines.

Problem map

8 core

Agent Evaluation
Reliable measurement of agent capability and failure.
CoreOpen →
Agent Observability
Instrumentation for understanding agent internals and behavior.
CoreOpen →
Distribution Shift
Performance changes when deployment differs from training.
CoreOpen →
Objective Misalignment
Mismatch between optimized and intended objectives.
CoreOpen →
Reward Hacking
Policies exploiting flaws in specified rewards.
CoreOpen →
Reward Model Drift
Changes in reward model behavior over time.
CoreOpen →
Safe RL
Sequential decision making under explicit safety constraints.
CoreOpen →
Safety Instrumentation
Systems that expose leading indicators of unsafe behavior.
CoreOpen →

Problem → paper → project loopEvidence-backed and reviewable

Current bottleneck: Formalize measurable leading indicators and a falsifiable benchmark.

deep work: Formalize a leading-indicator detection criterion
Highest-priority deep-work task (prove).
communication: Clear an outreach follow-up
Review due follow-ups and prepare a draft (Atlas will not send it).
maintenance: Tidy the inbox / review queue
Process the inbox and accept/reject proposed links.

Deferred: Specify a controlled reward-hacking onset benchmark