SciLayer

Adaptive Scientific Discovery Benchmark (ASDB): A Two-Track Framework for Evaluating Interactive Agents

Most agent benchmarks assume documented tool semantics and static ground-truth answers. Real scientific inquiry requires agents to learn what interventions do from state transitions, then infer hidden mechanisms, design discriminating experiments, and predict held-out observables. ASDB unifies two complementary tracks: Action Semantics Discovery (inferring an action map φ̂(a) from unlabeled controls) and Scientific Discovery Evaluation (recovering hidden theory classes under an intervention budget). Both share one interaction loop but score different constructs. Linked A→B episodes, decoy falsification, tiered difficulty, and decomposable metrics aim at construct validity for adaptive scientific reasoning evaluation.

interactive agents

Adaptive Scientific Discovery Benchmark (ASDB): A Two-Track Framework for Evaluating Interactive Agents