SciLayer

Interim consolidated evaluation report for ASRA: three tracks (gateway plumbing, competition scores, repeated-run learning), Phase 1–2 Kaggle results, Phase 2 Original ARC benchmark summary, agent version ladder, and roadmap to v1 after Stage 1 gateway migration completes.

Status: Interim preprint v0 (SciLayer Systems) — in progress
Repository copy: documents/evaluation/asra-arc-agi-3-evaluation-report-v0.md
SciLayer: https://sci-layer.vercel.app/articles/asra-arc-agi-3-evaluation-report-v0
Target: v1 after Stage 1 gateway migration completes (Phases 1–9 all Succeeded)

Purpose: Consolidated metrics document for ASRA Phases 1–9: competition scores, benchmark results, agent version ladder, and evaluation tracks. v0 publishes available evidence now; sections marked TBD fill as Stage 1 submits complete and Stage 2 score work begins.

Document status (v0)

Section	Status	Notes
Executive summary	Partial	P1–2 scores only
Track A — gateway plumbing	Complete	See gateway spec
Track B — competition scores	Partial	2/9 phases submitted
Track C — repeated-run learning	Protocol only	Results TBD
Phase 2 Original ARC	Complete	See Phase 2 results paper
Phases 3–9 benchmarks	TBD	Pending Stage 1 + library evals
Agent version ladder	Partial	v0.1–v0.4 verified
Limitations	Draft

Upgrade path: v0 → v1 when Stage 1 exit criteria met (gateway migration complete).

1. Executive summary

ASRA implements a nine-layer transition-centric cognitive stack for adaptive reasoning on ARC-AGI-3. This report consolidates three evaluation tracks:

Track	Question	v0 status
A. Plumbing	Does Kaggle score the notebook?	Solved — gateway sidecar pattern
B. Intelligence	How well does each phase agent play?	Partial — P1 0.03, P2 0.00
C. Learning	Does the agent improve on repeat?	Protocol published; results pending

Key v0 findings:

Generic Kaggle Error on ARC-AGI-3 was an evaluation-contract mismatch, not platform failure — fixed by official gateway pattern (Phase 1 v3 ref 53652655).
Gateway migration verified on Phase 2 (ref 53660658, score 0.00).
Phase 2 Observation Engine: 100% rule-candidate coverage on 800 Original ARC tasks (results paper).
Stage 1 in progress: Phases 3–9 gateway submits pending (1 submission/day limit).

2. Evaluation tracks

2.1 Track A — Evaluation plumbing (Kaggle gateway)

Status: Complete for Phase 1–2; migration tooling ready for Phases 3–9.

Requirement	Status
Official 4-cell notebook layout	✅ `_shared/gateway_notebook.py`
Template agent (`MyAgent`, no Swarm)	✅ Phases 1–9 extracted
Dummy parquet schema	✅ `row_id, game_id, end_of_game, score`
Scoring rerun via gateway sidecar	✅ Verified P1, P2

Reference: ARC-AGI-3 Kaggle Gateway Deployment Spec

2.2 Track B — Competition intelligence (public score)

Metric: Kaggle public score after gateway scoring rerun (single run per submit).

Stage 1 goal: Green Succeeded status per phase — not score optimization.

Phase	Agent tag	Kernel	Ref	Status	Public score
1	`asra-v0.1-phase1`	v3	53652655	Succeeded	0.03
2	`asra-v0.4-phase2`	v5	53660658	Succeeded	0.00
3	`asra-v0.5-phase3`	v2	—	Push done; submit pending	TBD
4	`asra-v0.6-phase4`	—	—	Not submitted	TBD
5	`asra-v0.7-phase5`	—	—	Not submitted	TBD
6	`asra-v0.8-phase6`	—	—	Not submitted	TBD
7	`asra-v0.85-phase7`	—	—	Not submitted	TBD
8	`asra-v0.9-phase8`	—	—	Not submitted	TBD
9	`asra-v1.0-phase9`	—	—	Not submitted	TBD

Interpretation (v0): Phase 1 baseline (transition logging + exploration) achieves non-zero score (0.03). Phase 2 adds object-scene hints without score gain on first gateway submit — expected at Stage 1; score iteration is Stage 2+.

Historical note: Pre-gateway submits (v1/v2) for all phases returned generic Kaggle Error — not comparable scores.

2.3 Track C — Repeated-run adaptive learning

Status: Protocol published; empirical results TBD.

Protocol: Repeated-Run Learning Eval for ARC-AGI-3

Setup	Description	v0 results
A	Full-game repetition (ls20, bp35)	TBD
B	Single-level repetition	TBD

Primary metrics (when run): action count, dead-end revisit rate, visit redundancy, adaptive vs fresh-agent control.

Constraint: Kaggle daily submit does not persist memory between competition reruns — Track C measured in research library (asra-arc) or instrumented kernels.

3. Phase-by-phase metrics

3.1 Phase 1 — Experience Engine

Metric	Value	Source
Transition schema tests	24 pass	`pytest`
Hash stability audit	PASS	`complete-phase1`
Export pipeline	JSONL + Parquet + CSV	`data/exports/asra_v0_*`
Kaggle gateway submit	Succeeded	ref 53652655
Public score	0.03	Track B
Live ARC-AGI-3 API episode	Pending credentials	—

Theory: Phase 1 preprint

3.2 Phase 2 — Observation Engine

Metric	Training	Evaluation
Rule-candidate coverage	100% (400/400)	100% (400/400)
Common-rule coverage (conf 1.0)	98.0%	97.75%
Avg objects / scene	13.16	25.40
Avg transform events / pair	16.86	30.90
Parse errors	0	0

Full report: Phase 2 Original ARC Evaluation Results

Metric	Value	Source
Kaggle gateway submit	Succeeded	ref 53660658
Public score	0.00	Track B

3.3 Phase 3 — Exploration & Memory

Metric	v0 status
MiniGrid / BabyAI benchmarks	Library complete — TBD consolidated table
DoorKey benchmark	TBD
Kaggle gateway submit	Push complete — submit TBD
Public score	TBD

Theory: Phase 3 preprint
Spec: Phase 3 technical specification

3.4 Phase 4 — Causal Action Semantics

Metric	v0 status
Semantic accuracy / calibration	TBD
Counterfactual query tests	TBD
Kaggle submit	TBD

3.5 Phase 5 — Goal Inference

Metric	v0 status
Hypothesis ranking accuracy	TBD
Progress detection	TBD
Kaggle submit	TBD

3.6 Phase 6 — Planning & Strategy

Metric	v0 status
Planner success rate	TBD
Actions-to-win	TBD
Kaggle submit	TBD

3.7 Phase 7 — Robustness & Generalization

Metric	v0 status
Stuck / waste detection	TBD
Procgen / DMLab delta	TBD
Eval dashboard	TBD — `data/robustness/dashboard/`
Kaggle submit	TBD

3.8 Phase 8 — Decision Biology Bridge

Metric	v0 status
Pathway hypothesis rank	TBD
Kaggle submit	TBD

3.9 Phase 9 — Integration

Metric	v0 status
Full stack self-test	Local pass — TBD Kaggle gateway
Integrated hint weights	`ASRA_*_HINT_WEIGHT` env-tunable
Kaggle submit (v1.0)	TBD
Demo video	TBD
Architecture diagram SVG	TBD

4. Agent version ladder

Cumulative agent tags embedded in Kaggle template agents:

Version	Phase	Layers active	Kaggle verified (gateway)	Score
v0.1	1	Experience	✅ ref 53652655	0.03
v0.4	2	+ Observation	✅ ref 53660658	0.00
v0.5	3	+ Memory / exploration	Push only	TBD
v0.6	4	+ Causality	—	TBD
v0.7	5	+ Goals	—	TBD
v0.8	6	+ Planning	—	TBD
v0.85	7	+ Robustness	—	TBD
v0.9	8	+ Biology bridge	—	TBD
v1.0	9	Full integration	—	TBD

v1 report goal: Complete score column + layer ablation (disable Phase N hints → measure Δ score).

5. Data sources

Path	Content
`asra-arc/data/analysis/phase2/`	Original ARC eval (complete)
`asra-arc/data/exports/`	Transition datasets
`asra-arc/data/robustness/`	Phase 7 dashboard (partial)
`kaggle-notebooks/phaseN/`	Per-phase agents + notebooks
Kaggle submissions API	Competition refs + scores

6. Roadmap to v1

Step	Action	Unblocks
1	Complete Stage 1 submits (P3→P9)	Full Track B score ladder
2	Stage 2 — iterate Phase 1/9 score	Non-zero competitive scores
3	Run repeated-run protocol (ls20/bp35)	Track C results
4	Consolidate Phase 3–7 library benchmarks	Phase sections 3.3–3.7
5	Architecture SVG + demo video	Phase 9 narrative
6	Publish Evaluation Report v1	Final Phase 9 deliverable

7. Limitations (v0)

Incomplete score ladder — only Phases 1–2 gateway-verified at time of writing.
Single-run scores — no repeated-run learning evidence in Track B.
Original ARC ≠ ARC-AGI-3 — Phase 2 benchmark is static perception; competition is interactive.
Daily submit cap — 1/day slows Stage 1 completion.
Pre-gateway scores invalid — earlier ERROR submits excluded from comparison.

8. References

9. One-line takeaway (v0)

ASRA v0 evaluation confirms gateway scoring works (Track A), establishes Phase 2 perception on 800 ARC tasks, and records first competition scores (P1: 0.03, P2: 0.00) — v1 completes the phase ladder once Stage 1 migration finishes.

ASRA Evaluation Report — ARC-AGI-3 & Phase Benchmarks (v0)