SciLayer

Evaluation protocol for measuring genuine learning across repeated ARC-AGI-3 episodes: full-game and single-level setups, primary metrics (action count, dead-end rate, visit redundancy), cross-run persistence requirements, and mapping to ASRA Phase 1–3 transition logging, semantics inference, and episodic memory.

Status: Published preprint (SciLayer Systems)
Companion: ASRA Phase 1, Phase 3

Purpose: Formalize an evaluation protocol for measuring genuine learning across repeated ARC-AGI-3 episodes — not single-episode peak score alone. Defines setups, metrics, persistence requirements, and how ASRA's transition-centric stack produces measurable run-over-run improvement.

1. Motivation

ARC-AGI-3 is an interactive benchmark: agents act, observe state changes, and accumulate evidence over many steps. A single competition score answers:

Did this agent solve anything on this run?

It does not answer:

Does the agent learn from prior interaction — fewer actions, fewer dead ends, better semantics — when the same game or level is attempted again?

Repeated-run evaluation separates memorization of a fixed policy from adaptive reasoning: updating action semantics, visit memory, and plans from transition evidence collected on earlier attempts.

This protocol draws on community discussion of full-game vs single-level repetition (including games ls20 and bp35 as practical test beds) and aligns with ASRA's core loop: log transitions → infer semantics → avoid dead ends → explore efficiently on the next attempt.

2. Evaluation question

Primary hypothesis (adaptive agent):

Repeated interaction with the same game or level should monotonically improve efficiency metrics — holding environment stochasticity and agent seed policy fixed — if learning is occurring.

Null hypothesis (fresh agent each run):

Run k and run k+1 are independent; action count and error rate do not decrease unless by chance.

Distinguishing these requires explicit persistence of memory artifacts between runs and paired comparison of trajectories.

3. Setup A — full-game repetition

Procedure:

Run 1: Agent plays the full game from initial state to terminal (win, loss, or max actions). Log full transition sequence τ₁.
Run 2: Agent replays the same game with memory from Run 1 loaded. Log τ₂.
Run 3+: Continue until action count plateaus or max runs R reached.

Compare: τ₁ vs τ₂ vs τ₃ — action sequences, visit maps, dead-end revisits, time-to-win.

Success criterion (Setup A):

actions(run_k) < actions(run_{k-1})   for some k ≥ 2
AND
dead_end_revisits(run_k) < dead_end_revisits(run_{k-1})

Interpretation: If Run 3 beats Run 2 on action count and reduces repeated dead-end (state, action) pairs, the agent is using cross-run memory — not just lucky exploration noise.

4. Setup B — single-level repetition

Procedure:

Fix a single level within a game (e.g. one stage of ls20 or bp35).
Run 1: Play level from entry state to level terminal or cap.
Run 2+: Replay the same level with persisted memory.

Compare: Per-level action count, unique states visited, subgoal completion time.

Success criterion (Setup B):

median_actions_per_level(run_k) decreases across k
AND
visit_redundancy(run_k) decreases  (fewer re-visits to same state_hash)

Setup B isolates learning at finer granularity — useful when full-game variance dominates.

5. Primary metrics

Metric	Definition	Learning signal
Action count	Total `GameAction` steps to terminal	Lower on later runs → efficiency gain
Dead-end revisit rate	Fraction of steps repeating a taboo `(state_hash, action)`	Lower → semantics / dead-end memory working
Visit redundancy	`revisits / unique_states` per episode	Lower → exploration memory working
Time-to-first-progress	Steps until reward proxy increases	Lower → goal/progress detection improving
Semantic confidence	Mean confidence of inferred action labels on repeated `(s,a)` pairs	Higher → causality layer converging
Trajectory edit distance	Levenshtein distance between action sequences run k vs k−1	Decreasing divergence with lower action count → stable improvement, not random

Secondary metrics (competition-aligned):

Metric	Source
Public Kaggle score	`submission.parquet` after gateway scoring
Levels completed	Episode summary
Win rate	Terminal status across runs

Important: Kaggle public score is a single-run aggregate. Repeated-run metrics must be collected in local or instrumented episodes (research library) or via custom logging inside the competition agent — the default daily submit does not expose run-over-run series.

6. Cross-run persistence contract

For repeated-run eval to be meaningful, specify what persists between runs:

Artifact	Persists?	Producer (ASRA)	Purpose
Transition log JSONL	Yes	Phase 1 `EpisodeLogger`	Replay, compare τ₁…τₙ
`(state_hash, action) → semantics` map	Yes	Phase 1 inferencer, Phase 4 store	Avoid re-discovering action effects
Dead-end taboo set	Yes	Phase 1 `DeadEndDetector`	Skip known bad actions
Visit counts / exploration graph	Yes	Phase 3 `VisitationMemory`, `ExplorationGraph`	Reduce redundant exploration
Goal hypotheses	Optional	Phase 5	Faster goal lock-in on repeat
Plan cache	Optional	Phase 6	Reuse successful action sequences
Raw grid frames	No (derive from hash)	—	Storage efficiency

Fresh-agent control: Run the same protocol with persistence disabled between runs. Adaptive agents should beat fresh-agent controls on action count by run 2–3; if not, claimed "learning" is likely within-run exploration only.

Kaggle constraint: Competition reruns typically start a new scoring episode without filesystem persistence between daily submits. Repeated-run eval is therefore primarily a research-library or custom instrumented kernel measurement — not inferable from one public score per day.

7. ASRA layer mapping

How each ASRA layer contributes to measurable run-over-run improvement:

Run	Phase	Expected contribution
Run 1	Phase 1	Collect τ = (s, a, s′, r); build state graph; infer preliminary semantics; mark dead ends
Run 2	Phase 1 + 3	Taboo dead ends; bias away from high-visit states; replay buffer informs exploration prior
Run 2	Phase 4	Higher semantics confidence on seen (s, a); lower uncertainty bonus → less random probing
Run 3	Phase 5–6	Goal hypotheses rank faster; plan reuse from strategy library
Run 3+	Phase 7	Stuck/waste detectors penalize loops that survived earlier runs

Canonical transition (join key across layers):

τ = (s, a, s′, r, terminal, diff, metadata)
state_hash = SHA-256(canonical grid JSON)

All cross-run comparisons should key on state_hash and episode_id scope — not raw pixel buffers.

8. Recommended experimental protocol

8.1 Games and levels

Start with games suitable for short iteration cycles:

Target	Setup	Rationale
ls20	Setup A or B	Community-suggested test bed for repeated-run comparison
bp35	Setup A or B	Same
Additional ARC-AGI-3 games	Setup A	Generalization of learning signal

Document game ID, level ID, max actions cap, and random seed policy for each run.

8.2 Run matrix

Minimum experiment for publication-quality evidence:

Condition	Persistence	Runs
Adaptive	Full Phase 1–3 memory	R = 5
Fresh control	None between runs	R = 5
Random baseline	N/A	1 per run slot

Report mean ± std of action count per run index k.

8.3 Analysis

Plot actions(k) and dead_end_revisits(k) for adaptive vs fresh.
Align trajectories on state_hash sequence; compute edit distance between consecutive runs.
Export transitions to JSONL; verify hash stability (audit PASS).
Optional: ablate Phase 3 memory off — if run-over-run gain disappears, memory layer is causal.

9. Relation to competition scoring

Two tracks (do not conflate):

Track	Question	Where measured
A. Evaluation plumbing	Does Kaggle score the notebook?	Gateway sidecar submit — see Gateway Deployment Spec
B. Agent intelligence	How well does the agent play?	Public score (single run)
C. Adaptive learning	Does the agent improve on repeat?	This protocol — local / instrumented runs

Track C requires persistence and paired runs — outside default Kaggle daily submit semantics.

Practical recommendation: Use Track A+B for leaderboard presence; use Track C for research claims about adaptive reasoning.

10. Submission frequency (eval infrastructure note)

Measuring learning curves meaningfully benefits from multiple evaluations per day (e.g. 5–10 submits) to iterate persistence and memory policies quickly. A single daily submit constrains feedback to one scalar — insufficient for run-over-run analysis on the platform itself.

Local replay evaluation (python -m asra complete-phase1, transition export, replay viewer) remains the primary loop for Track C regardless of submit quota.

11. Reporting template

Minimum tables for a repeated-run results section:

Table 1 — Run-over-run action count (Setup A, game ls20)

Run k	Actions	Dead-end revisits	Unique states	Win
1
2
3

Table 2 — Adaptive vs fresh (Run 3)

Condition	Actions (mean ± std)	Δ vs Run 1
Adaptive
Fresh control

Figure 1: actions(k) line plot, adaptive vs fresh, R runs.

12. Limitations

Stochastic environments: If level layout varies between runs, action count may not decrease monotonically — report distribution, not single pairs.
Kaggle isolation: Default competition rerun does not persist agent memory between daily submits; Track C is research-library first.
Score vs efficiency: Public score may not correlate with action efficiency on repeated runs — report both.
Overfitting to one game: Gains on ls20/bp35 must be replicated on held-out games.

13. References

14. One-line takeaway

A single ARC-AGI-3 score measures one run; repeated-run eval measures whether the agent learns. Formalize Setup A (full game) or Setup B (single level), persist transition-derived memory between runs, and report action count and dead-end revisits — adaptive agents should beat fresh-agent controls by run 2–3 if reasoning is real.

Repeated-Run Learning Evaluation for ARC-AGI-3 — Protocol and ASRA Mapping